Publications by Topic

Alternately see my publications by year.

Api Mining
Deep Learning
Interactive Machine Learning
Markov Chain Monte Carlo
Approximate Inference
Big Code
Computer Security
Conditional Random Fields
Data Cleaning
Data Mining
Data Science
Data Sets
Data Wrangling
Databases
Deep Generative Models
Deep Learning
Energy-based Models
Evaluation Of Machine Learning
Graph Neural Networks
Hardware Design Automation
Information Extraction
Interpretability
Language Modelling
Natural Language Processing
Nonparametric Bayesian Models
Probabilistic Programming
Program Repair
Program Synthesis
Programming Languages
Queueing Networks
Software Engineering
Sustainable Energy
Systems / Machine Learning
Topic Models
Visualization
Weak Supervision

Software Engineering

Natural Language to Code Generation in Interactive Data Science Notebooks. Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov and Charles Sutton. In Proceedings of the Association of Computational Linguistics (ACL). 2023.

[ arXiv | bib | abstract | source code ]

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
```
@inproceedings{yin23arcade,
  author = {Yin, Pengcheng and Li, Wen-Ding and Xiao, Kefan and Rao, Abhishek and Wen, Yeming and Shi, Kensen and Howland, Joshua and Bailey, Paige and Catasta, Michele and Michalewski, Henryk and Polozov, Alex and Sutton, Charles},
  booktitle = {Proceedings of the Association of Computational Linguistics (ACL)},
  title = {Natural Language to Code Generation in Interactive Data Science Notebooks},
  year = {2023}
}
```
Can Large Language Models Reason about Program Invariants?. Kexin Pei, David Bieber, Kensen Shi, Charles Sutton and Pengcheng Yin. In International Conference on Machine Learning. 2023.

[ .pdf | bib | abstract ]

Identifying invariants is an important program analysis task with applications towards program understanding, bug finding, vulnerability analysis, and formal verification. Existing tools for identifying program invariants rely on dynamic analysis, requiring traces collected from multiple executions in order to produce reliable invariants. We study the application of large language models to invariant prediction, finding that models trained on source code and fine-tuned for invariant generation can perform invariant prediction as static rather than dynamic analysis. Using a scratchpad approach where invariants are predicted sequentially through a program gives the best performance, finding invariants statically of quality comparable to those obtained by a dynamic analysis tool with access to five program traces.
```
@inproceedings{pei23invariants,
  author = {Pei, Kexin and Bieber, David and Shi, Kensen and Sutton, Charles and Yin, Pengcheng},
  booktitle = {International Conference on Machine Learning},
  month = {jun},
  title = {Can Large Language Models Reason about Program Invariants?},
  year = {2023}
}
```

How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset. Rafael-Michael Karampatsis and Charles Sutton. In Working Conference on Mining Software Repositories (MSR; Data Showcase). 2020.

[ arXiv | bib ]

@inproceedings{karampatsis2019singlestatement,
  author = {Karampatsis, Rafael-Michael and Sutton, Charles},
  booktitle = {Working Conference on Mining Software Repositories (MSR; Data Showcase)},
  title = {How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset},
  year = {2020}
}

Learning to Fix Build Errors with Graph2Diff Neural Networks. Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton and Edward Aftandilian. In ICSE Workshop on Automated Program Repair. 2020.

[ arXiv | bib ]

@inproceedings{tarlow2020graph2diff-ws,
  author = {Tarlow, Daniel and Moitra, Subhodeep and Rice, Andrew and Chen, Zimin and Manzagol, Pierre-Antoine and Sutton, Charles and Aftandilian, Edward},
  booktitle = {ICSE Workshop on Automated Program Repair},
  title = {Learning to Fix Build Errors with Graph2Diff Neural Networks},
  year = {2020}
}

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code. Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton and Andrea Janes. In International Conference on Software Engineering (ICSE). 2020.

[ .pdf | bib ]

@inproceedings{karampatsis20big_code,
  author = {Karampatsis, Rafael-Michael and Babii, Hlib and Robbes, Romain and Sutton, Charles and Janes, Andrea},
  booktitle = {International Conference on Software Engineering (ICSE)},
  title = {Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code},
  year = {2020}
}

Global Relational Models of Source Code. Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis and David Bieber. In International Conference on Learning Representations. 2020.

[ .pdf | bib ]

@inproceedings{hellendoorn2020global,
  author = {Hellendoorn, Vincent J. and Sutton, Charles and Singh, Rishabh and Maniatis, Petros and Bieber, David},
  booktitle = {International Conference on Learning Representations},
  title = {Global Relational Models of Source Code},
  year = {2020}
}

Where should I comment my code? A dataset and model for predicting locations that need comments. Annie Louis, Santanu Kumar Dash, Earl T Barr, Michael D Ernst and Charles Sutton. In International Conference on Software Engineering (ICSE; NIER track). 2020.

[ .pdf | bib | source code ]

@inproceedings{louis2020-ei,
  author = {Louis, Annie and Dash, Santanu Kumar and Barr, Earl T and Ernst, Michael D and Sutton, Charles},
  booktitle = {International Conference on Software Engineering (ICSE; NIER track)},
  title = {Where should I comment my code? A dataset and model for predicting locations that need comments},
  year = {2020}
}

Learning to Fix Build Errors with Graph2Diff Neural Networks. Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton and Edward Aftandilian. 2019.

[ arXiv | bib ]

@unpublished{tarlow2019learning,
  author = {Tarlow, Daniel and Moitra, Subhodeep and Rice, Andrew and Chen, Zimin and Manzagol, Pierre-Antoine and Sutton, Charles and Aftandilian, Edward},
  title = {Learning to Fix Build Errors with Graph2Diff Neural Networks},
  year = {2019}
}

Summarizing Software API Usage Examples using Clustering Techniques. Nikolaos Katirtzis, Themistoklis Diamantopoulos and Charles Sutton. In International Conference on Fundamental Approaches to Software Engineering (FASE). 2018.

[ .pdf | bib | source code ]

@inproceedings{katirtzis18,
  author = {Katirtzis, Nikolaos and Diamantopoulos, Themistoklis and Sutton, Charles},
  booktitle = {International Conference on Fundamental Approaches to Software Engineering (FASE)},
  title = {Summarizing Software API Usage Examples using Clustering Techniques},
  year = {2018}
}

Mining Semantic Loop Idioms. Miltiadis Allamanis, Earl T. Barr, Christian Bird, Premkumar Devanbu, Mark Marron and Charles Sutton. IEEE Transactions on Software Engineering 44 (7). 2018.

[ .pdf | bib ]

@article{tse-coils,
  author = {Allamanis, Miltiadis and Barr, Earl T. and Bird, Christian and Devanbu, Premkumar and Marron, Mark and Sutton, Charles},
  journal = {IEEE Transactions on Software Engineering},
  month = {may},
  number = {7},
  pages = {651-668},
  title = {Mining Semantic Loop Idioms},
  volume = {44},
  year = {2018}
}

Autofolding for Source Code Summarization. Jaroslav Fowkes, Razvan Ranca, Miltiadis Allamanis, Mirella Lapata and Charles Sutton. IEEE Transactions on Software Engineering 43 (12). 2017.

[ .pdf | bib ]

@article{fowkes17,
  author = {Fowkes, Jaroslav and Ranca, Razvan and Allamanis, Miltiadis and Lapata, Mirella and Sutton, Charles},
  journal = {IEEE Transactions on Software Engineering},
  month = {dec},
  number = {12},
  pages = {1095-1109},
  title = {Autofolding for Source Code Summarization},
  volume = {43},
  year = {2017}
}

Parameter-Free Probabilistic API Mining across GitHub. Jaroslav Fowkes and Charles Sutton. In Foundations of Software Engineering (FSE). 2016.

[ .pdf | bib | code and data ]

@inproceedings{fowkes16api,
  author = {Fowkes, Jaroslav and Sutton, Charles},
  booktitle = {Foundations of Software Engineering (FSE)},
  title = {Parameter-Free Probabilistic API Mining across GitHub},
  year = {2016}
}

A Convolutional Attention Network for Extreme Summarization of Source Code. Miltiadis Allamanis, Hao Peng and Charles Sutton. In International Conference in Machine Learning (ICML). 2016.

[ .pdf | bib ]

@inproceedings{allamanis16convattn,
  author = {Allamanis, Miltiadis and Peng, Hao and Sutton, Charles},
  booktitle = {International Conference in Machine Learning (ICML)},
  journal = {ArXiv e-prints},
  title = {A Convolutional Attention Network for Extreme Summarization of Source Code},
  year = {2016}
}

Suggesting Accurate Method and Class Names. Miltiadis Allamanis, Earl T. Barr, Christian Bird and Charles Sutton. In Foundations of Software Engineering (FSE). 2015. (Neural network model that can suggest a name for a method or class, given the method’s body and signature.)

[ .pdf | bib | abstract | source code ]

Descriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, suggesting names for meth- ods and classes is much more difficult. This is because good method and class names need to be functionally descriptive, but suggesting such names requires that the model goes beyond local context. We introduce a neural probabilistic language model for source code that is specifically designed for the method naming problem. Our model learns which names are semantically similar by assigning them to locations, called embeddings, in a high-dimensional continuous space, in such a way that names with similar embeddings tend to be used in similar contexts. These embeddings seem to contain semantic information about tokens, even though they are learned only from statistical co-occurrences of tokens. Furthermore, we introduce a variant of our model that is, to our knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. We obtain state of the art results on the method, class, and even the simpler variable naming tasks. More broadly, the continuous embeddings that are learned by our model have the potential for wide application within software engineering.
```
@inproceedings{neural-naturalize,
  annote = {Neural network model that can suggest a name for a method or class, given the method’s body and signature.},
  author = {Allamanis, Miltiadis and Barr, Earl T. and Bird, Christian and Sutton, Charles},
  booktitle = {Foundations of Software Engineering (FSE)},
  title = {Suggesting Accurate Method and Class Names},
  year = {2015}
}
```

Mining idioms from source code. Miltos Allamanis and Charles Sutton. In Symposium on the Foundations of Software Engineering (FSE). 2014.

[ .pdf | bib ]

@inproceedings{idioms,
  author = {Allamanis, Miltos and Sutton, Charles},
  booktitle = {Symposium on the Foundations of Software Engineering (FSE)},
  title = {Mining idioms from source code},
  year = {2014}
}

Learning Natural Coding Conventions. Miltiadis Allamanis, Earl T Barr, Christian Bird and Charles Sutton. In Symposium on the Foundations of Software Engineering (FSE). 2014.

(Winner, ACM SIGSOFT Distinguished Paper Award.)

[ .pdf | bib | source code ]

@inproceedings{naturalize,
  author = {Allamanis, Miltiadis and Barr, Earl T and Bird, Christian and Sutton, Charles},
  booktitle = {Symposium on the Foundations of Software Engineering (FSE)},
  title = {Learning Natural Coding Conventions},
  year = {2014}
}

Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code. Miltos Allamanis and Charles Sutton. In Working Conference on Mining Software Repositories (MSR). 2013.

[ .pdf | bib ]

@inproceedings{allamanis13why-when,
  author = {Allamanis, Miltos and Sutton, Charles},
  booktitle = {Working Conference on Mining Software Repositories (MSR)},
  title = {Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code},
  year = {2013}
}

Mining Source Code Repositories at Massive Scale using Language Modeling. Miltos Allamanis and Charles Sutton. In Working Conference on Mining Software Repositories (MSR). 2013.

[ .pdf | bib ]

@inproceedings{allamanis13mining,
  author = {Allamanis, Miltos and Sutton, Charles},
  booktitle = {Working Conference on Mining Software Repositories (MSR)},
  title = {Mining Source Code Repositories at Massive Scale using Language Modeling},
  year = {2013}
}

Natural Language Processing

Natural Language to Code Generation in Interactive Data Science Notebooks. Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov and Charles Sutton. In Proceedings of the Association of Computational Linguistics (ACL). 2023.

[ arXiv | bib | abstract | source code ]

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
```
@inproceedings{yin23arcade,
  author = {Yin, Pengcheng and Li, Wen-Ding and Xiao, Kefan and Rao, Abhishek and Wen, Yeming and Shi, Kensen and Howland, Joshua and Bailey, Paige and Catasta, Michele and Michalewski, Henryk and Polozov, Alex and Sutton, Charles},
  booktitle = {Proceedings of the Association of Computational Linguistics (ACL)},
  title = {Natural Language to Code Generation in Interactive Data Science Notebooks},
  year = {2023}
}
```
Unsupervised Deduplication using Cross-field Dependencies. Robert Hall, Charles Sutton and Andrew McCallum. In Conference on Knowledge Discovery and Data Mining (KDD). 2008. (Hierarchical DP model that jointly clusters citation venue strings based on both string-edit distance and title information.)

[ .pdf | bib | abstract ]

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent—because venues tend to focus on a few research areas—but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance—a 58% reduction in error over a standard Dirichlet process mixture.
```
@inproceedings{hall08unsupervised,
  annote = {Hierarchical DP model that jointly clusters citation venue strings based on both string-edit distance and title information.},
  author = {Hall, Robert and Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Knowledge Discovery and Data Mining (KDD)},
  title = {Unsupervised Deduplication using Cross-field Dependencies},
  year = {2008}
}
```

Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors. Hanna Wallach, Charles Sutton and Andrew McCallum. In ICML Workshop on Prior Knowledge for Text and Language Processing. 2008. (Two Bayesian dependency parsing models: 1. Model with Pitman-Yor prior that significantly improves Eisner’s classic model; 2. Latent-variable model that learns "syntactic" topics.)

[ .pdf | bib ]

@inproceedings{wallach08bayesian,
  annote = {Two Bayesian dependency parsing models: 1. Model with Pitman-Yor prior that significantly improves Eisner’s classic model; 2. Latent-variable model that learns "syntactic" topics.},
  author = {Wallach, Hanna and Sutton, Charles and McCallum, Andrew},
  booktitle = {ICML Workshop on Prior Knowledge for Text and Language Processing},
  title = {Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors},
  year = {2008}
}

Joint Parsing and Semantic Role Labeling. Charles Sutton and Andrew McCallum. In Conference on Natural Language Learning (CoNLL). 2005.

[ .pdf | bib ]

@inproceedings{sutton05joint,
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Natural Language Learning (CoNLL)},
  pages = {225–228},
  title = {Joint Parsing and Semantic Role Labeling},
  year = {2005}
}

Composition of Conditional Random Fields for Transfer Learning. Charles Sutton and Andrew McCallum. In Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP). 2005.

[ .pdf | bib ]

@inproceedings{sutton05transfer,
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP)},
  title = {Composition of Conditional Random Fields for Transfer Learning},
  year = {2005}
}

Collective Segmentation and Labeling of Distant Entities in Information Extraction. Charles Sutton and Andrew McCallum. In ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields. 2004.

[ .pdf | bib ]

@inproceedings{sutton04skip,
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields},
  title = {Collective Segmentation and Labeling of Distant Entities in Information Extraction},
  year = {2004}
}

Markov Chain Monte Carlo

Any-scale Balanced Samplers for Discrete Space. Haoran Sun, Bo Dai, Charles Sutton, Dale Schuurmans and Hanjun Dai. In International Conference on Learning Representations. 2023.

[ .pdf | bib | abstract ]

The locally balanced informed proposal has proved to be highly effective for sampling from discrete spaces. However, its success relies on the “local” factor, which ensures that whenever the proposal distribution is restricted to be near the current state, the locally balanced weight functions are asymptotically optimal and the gradient approximations are accurate. In seeking a more efficient sampling algorithm, many recent works have considered increasing the scale of the proposal distributions, but this causes the ”local” factor to no longer hold. Instead, we propose any-scale balanced samplers to repair the gap in non-local proposals. In particular, we substitute the locally balanced function with an any-scale balanced function that can self-adjust to achieve better efficiency for proposal distributions at any scale. We also use quadratic approximations to capture curvature of the target distribution and reduce the error in the gradient approximation, while employing a Gaussian integral trick with a special estimated diagonal to efficiently sample from the quadratic proposal distribution. On various synthetic and real distributions, the proposed sampler substantially outperforms existing approaches.
```
@inproceedings{sun23anyscale,
  author = {Sun, Haoran and Dai, Bo and Sutton, Charles and Schuurmans, Dale and Dai, Hanjun},
  booktitle = {International Conference on Learning Representations},
  month = {sep},
  title = {Any-scale Balanced Samplers for Discrete Space},
  year = {2023}
}
```

Couplings for Multinomial Hamiltonian Monte Carlo. Kai Xu, Tor Erlend Fjelde, Charles Sutton and Hong Ge. In International Conference on Artificial Intelligence and Statistics (AISTATS). 2021.

[ arXiv | bib | source code ]

@inproceedings{xu:2021wa,
  author = {Xu, Kai and Fjelde, Tor Erlend and Sutton, Charles and Ge, Hong},
  booktitle = {International Conference on Artificial Intelligence and Statistics (AISTATS)},
  title = {Couplings for Multinomial Hamiltonian Monte Carlo},
  year = {2021}
}

Semi-Separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models. Yichuan Zhang and Charles Sutton. In Advances in Neural Information Processing Systems (NIPS). 2014.

[ .pdf | bib ]

@inproceedings{zhang2014,
  author = {Zhang, Yichuan and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Semi-Separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models},
  year = {2014}
}

Continuous Relaxations for Discrete Hamiltonian Monte Carlo. Yichuan Zhang, Charles Sutton, Amos Storkey and Zoubin Ghahramani. In Advances in Neural Information Processing Systems (NIPS). 2012.

[ .pdf | bib ]

@inproceedings{zhang12continuous,
  author = {Zhang, Yichuan and Sutton, Charles and Storkey, Amos and Ghahramani, Zoubin},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Continuous Relaxations for Discrete Hamiltonian Monte Carlo},
  year = {2012}
}

Quasi-Newton Markov chain Monte Carlo. Yichuan Zhang and Charles Sutton. In Advances in Neural Information Processing Systems (NIPS). 2011.

[ .pdf | bib ]

@inproceedings{zhang11quasi,
  author = {Zhang, Yichuan and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Quasi-Newton Markov chain Monte Carlo},
  year = {2011}
}

Programming Languages

Can Large Language Models Reason about Program Invariants?. Kexin Pei, David Bieber, Kensen Shi, Charles Sutton and Pengcheng Yin. In International Conference on Machine Learning. 2023.

[ .pdf | bib | abstract ]

Identifying invariants is an important program analysis task with applications towards program understanding, bug finding, vulnerability analysis, and formal verification. Existing tools for identifying program invariants rely on dynamic analysis, requiring traces collected from multiple executions in order to produce reliable invariants. We study the application of large language models to invariant prediction, finding that models trained on source code and fine-tuned for invariant generation can perform invariant prediction as static rather than dynamic analysis. Using a scratchpad approach where invariants are predicted sequentially through a program gives the best performance, finding invariants statically of quality comparable to those obtained by a dynamic analysis tool with access to five program traces.
```
@inproceedings{pei23invariants,
  author = {Pei, Kexin and Bieber, David and Shi, Kensen and Sutton, Charles and Yin, Pengcheng},
  booktitle = {International Conference on Machine Learning},
  month = {jun},
  title = {Can Large Language Models Reason about Program Invariants?},
  year = {2023}
}
```

Mining Semantic Loop Idioms. Miltiadis Allamanis, Earl T. Barr, Christian Bird, Premkumar Devanbu, Mark Marron and Charles Sutton. IEEE Transactions on Software Engineering 44 (7). 2018.

[ .pdf | bib ]

@article{tse-coils,
  author = {Allamanis, Miltiadis and Barr, Earl T. and Bird, Christian and Devanbu, Premkumar and Marron, Mark and Sutton, Charles},
  journal = {IEEE Transactions on Software Engineering},
  month = {may},
  number = {7},
  pages = {651-668},
  title = {Mining Semantic Loop Idioms},
  volume = {44},
  year = {2018}
}

Mining Semantic Loop Idioms from Big Code. Miltiadis Allamanis, Earl T. Barr, Christian Bird, Premkumar Devanbu, Mark Marron and Charles Sutton. Microsoft Research Technical Report, MSR-TR-2016-1116, 2016.

[ .pdf | bib | abstract ]

During maintenance, developers spend a lot of time transforming existing code: refactoring, optimizing, and adding checks to make it more robust. Much of this work is the drudgery of identifying and replacing specific patterns, yet it resists automation, because of meaningful patterns are hard to automatically find. We present a technique for mining loop idioms, surprisingly probable semantic patterns that occur in loops, from big code to find meaningful patterns. First, we show that automatically identifiable patterns exist, in great numbers, with a large scale empirical study of loop over 25 MLOC. We find that loops in this corpus are simple and predictable: 90% of them have fewer than 15LOC and 90% have no nesting and very simple control structure. Encouraged by this result, we coil loops to abstract away syntactic diversity to define information rich loop idioms. We show that only 50 loop idioms cover 50% of the concrete loops. We show how loop idioms can help a tool developers identify and prioritize refactorings. We also show how our framework opens the door to data-driven tool and language design discovering opportunities to introduce new API calls and language constructs: loop idioms show that LINQ would benefit from an Enumerate operator, a result confirmed by the fact that precisely this feature is one of the most requested features on StackOverflow with 197 votes and 95k views.
```
@techreport{semantic-idioms-tr,
  author = {Allamanis, Miltiadis and Barr, Earl T. and Bird, Christian and Devanbu, Premkumar and Marron, Mark and Sutton, Charles},
  institution = {Microsoft Research},
  month = {nov},
  number = {MSR-TR-2016-1116},
  title = {Mining Semantic Loop Idioms from Big Code},
  year = {2016}
}
```

Data Science

Natural Language to Code Generation in Interactive Data Science Notebooks. Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov and Charles Sutton. In Proceedings of the Association of Computational Linguistics (ACL). 2023.

[ arXiv | bib | abstract | source code ]

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
```
@inproceedings{yin23arcade,
  author = {Yin, Pengcheng and Li, Wen-Ding and Xiao, Kefan and Rao, Abhishek and Wen, Yeming and Shi, Kensen and Howland, Joshua and Bailey, Paige and Catasta, Michele and Michalewski, Henryk and Polozov, Alex and Sutton, Charles},
  booktitle = {Proceedings of the Association of Computational Linguistics (ACL)},
  title = {Natural Language to Code Generation in Interactive Data Science Notebooks},
  year = {2023}
}
```

Program Synthesis

CrossBeam: Learning to Search in Bottom-Up Program Synthesis. Kensen Shi, Hanjun Dai, Kevin Ellis and Charles Sutton. In International Conference on Learning Representations (ICLR). 2022.

[ arXiv | bib ]

@inproceedings{shi2022-wd,
  author = {Shi, Kensen and Dai, Hanjun and Ellis, Kevin and Sutton, Charles},
  booktitle = {International Conference on Learning Representations (ICLR)},
  title = {CrossBeam: Learning to Search in Bottom-Up Program Synthesis},
  year = {2022}
}

SpreadsheetCoder: Formula Prediction from Semi-structured Context. Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin and Denny Zhou. In International Conference in Machine Learning (ICML). 2021.

[ to appear | bib ]

@inproceedings{chen21spreadsheetcoder,
  author = {Chen, Xinyun and Maniatis, Petros and Singh, Rishabh and Sutton, Charles and Dai, Hanjun and Lin, Max and Zhou, Denny},
  booktitle = {International Conference in Machine Learning (ICML)},
  title = {SpreadsheetCoder: Formula Prediction from Semi-structured Context},
  year = {2021}
}

Latent Programmer: Discrete Latent Codes for Program Synthesis. Joey Hong, David Dohan, Rishabh Singh, Charles Sutton and Manzil Zaheer. In International Conference in Machine Learning (ICML). 2021.

[ to appear | bib ]

@inproceedings{hong21latent,
  author = {Hong, Joey and Dohan, David and Singh, Rishabh and Sutton, Charles and Zaheer, Manzil},
  booktitle = {International Conference in Machine Learning (ICML)},
  title = {Latent Programmer: Discrete Latent Codes for Program Synthesis},
  year = {2021}
}

Program Synthesis with Large Language Models. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le and Charles Sutton. arXiv:2108.07732. 2021.

[ arXiv | bib ]

@misc{odena2021lamda,
  author = {Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and Sutton, Charles},
  title = {Program Synthesis with Large Language Models},
  year = {2021}
}

Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks. David Bieber, Charles Sutton, Hugo Larochelle and Daniel Tarlow. In Advances in Neural Information Processing Systems (NeurIPS). 2020.

[ .pdf | bib | abstract ]

Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Networks (IPA-GNN), which achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a heuristic function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks.
```
@inproceedings{bieber2020-ot,
  author = {Bieber, David and Sutton, Charles and Larochelle, Hugo and Tarlow, Daniel},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  title = {Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks},
  year = {2020}
}
```
Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration. Hanjun Dai, Rishabh Singh, Bo Dai, Charles Sutton and Dale Schuurmans. In Advances in Neural Information Processing Systems (NeurIPS). 2020.

[ .pdf | bib | abstract ]

Discrete structures play an important role in applications like program language modeling and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice in flexibility. Energy-based models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions, but require partition function estimation. In this paper we propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete structured data, where parameter gradients are estimated using a learned sampler that mimics local search. We show that the energy function and sampler can be trained efficiently via a new variational form of power iteration, achieving a better trade-off between flexibility and tractability. Experimentally, we show that learning local search leads to significant improvements in challenging application domains. Most notably, we present an energy model guided fuzzer for software testing that achieves comparable performance to well engineered fuzzing engines like libfuzzer.
```
@inproceedings{dai2020-bc,
  author = {Dai, Hanjun and Singh, Rishabh and Dai, Bo and Sutton, Charles and Schuurmans, Dale},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  title = {Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration},
  year = {2020}
}
```

Learning to Represent Programs with Property Signatures. Augustus Odena and Charles Sutton. In International Conference on Learning Representations. 2020.

[ .pdf | bib ]

@inproceedings{odena2020learning,
  author = {Odena, Augustus and Sutton, Charles},
  booktitle = {International Conference on Learning Representations},
  title = {Learning to Represent Programs with Property Signatures},
  year = {2020}
}

Language Modelling

PaLM: Scaling Language Modeling with Pathways. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov and Noah Fiedel. arXiv:2204.02311. 2022.

[ arXiv | bib ]

@misc{palm,
  author = {Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sasha and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakaran, Vinodkumar and Reif, Emily and Du, Nan and Hutchinson, Ben and Pope, Reiner and Bradbury, James and Austin, Jacob and Isard, Michael and Gur-Ari, Guy and Yin, Pengcheng and Duke, Toju and Levskaya, Anselm and Ghemawat, Sanjay and Dev, Sunipa and Michalewski, Henryk and Garcia, Xavier and Misra, Vedant and Robinson, Kevin and Fedus, Liam and Zhou, Denny and Ippolito, Daphne and Luan, David and Lim, Hyeontaek and Zoph, Barret and Spiridonov, Alexander and Sepassi, Ryan and Dohan, David and Agrawal, Shivani and Omernick, Mark and Dai, Andrew M. and Pillai, Thanumalayan Sankaranarayana and Pellat, Marie and Lewkowycz, Aitor and Moreira, Erica and Child, Rewon and Polozov, Oleksandr and Lee, Katherine and Zhou, Zongwei and Wang, Xuezhi and Saeta, Brennan and Diaz, Mark and Firat, Orhan and Catasta, Michele and Wei, Jason and Meier-Hellstern, Kathy and Eck, Douglas and Dean, Jeff and Petrov, Slav and Fiedel, Noah},
  publisher = {arXiv},
  title = {PaLM: Scaling Language Modeling with Pathways},
  year = {2022}
}

Sustainable Energy

The IDEAL household energy dataset, electricity, gas, contextual sensor data and survey data for 255 UK homes. Martin Pullinger, Jonathan Kilgour, Nigel Goddard, Niklas Berliner, Lynda Webb, Myroslava Dzikovska, Heather Lovell, Janek Mann, Charles Sutton, Janette Webb and Mingjun Zhong. Scientific Data 8 (1). 2021.

[ .pdf | bib | abstract | data ]

The IDEAL household energy dataset described here comprises electricity, gas and contextual data from 255 UK homes over a 23-month period ending in June 2018, with a mean participation duration of 286 days. Sensors gathered 1-second electricity data, pulse-level gas data, 12-second temperature, humidity and light data for each room, and 12-second temperature data from boiler pipes for central heating and hot water. 39 homes also included plug-level monitoring of selected electrical appliances, real-power measurement of mains electricity and key sub-circuits, and more detailed temperature monitoring of gas- and heat-using equipment, including radiators and taps. Survey data included occupant demographics, values, attitudes and self-reported energy awareness, household income, energy tariffs, and building, room and appliance characteristics. Linked secondary data comprises weather and level of urbanisation. The data is provided in comma-separated format with a custom-built API to facilitate usage, and has been cleaned and documented. The data has a wide range of applications, including investigating energy demand patterns and drivers, modelling building performance, and undertaking Non-Intrusive Load Monitoring research.
```
@article{pullinger2021-jg,
  author = {Pullinger, Martin and Kilgour, Jonathan and Goddard, Nigel and Berliner, Niklas and Webb, Lynda and Dzikovska, Myroslava and Lovell, Heather and Mann, Janek and Sutton, Charles and Webb, Janette and Zhong, Mingjun},
  journal = {Scientific Data},
  month = {may},
  number = {1},
  pages = {146},
  title = {The IDEAL household energy dataset, electricity, gas, contextual sensor data and survey data for 255 UK homes},
  volume = {8},
  year = {2021}
}
```

Sequence-to-Point Learning with Neural Networks for Non-intrusive Load Monitoring. Chaoyun Zhang, Mingjun Zhong, Zongzuo Wang, Nigel Goddard and Charles Sutton. In National Conference on Artificial Intelligence (AAAI). 2018.

[ .pdf | bib ]

@inproceedings{zhang18,
  author = {Zhang, Chaoyun and Zhong, Mingjun and Wang, Zongzuo and Goddard, Nigel and Sutton, Charles},
  booktitle = {National Conference on Artificial Intelligence (AAAI)},
  title = {Sequence-to-Point Learning with Neural Networks for Non-intrusive Load Monitoring},
  year = {2018}
}

Latent Bayesian melding for integrating individual and population models. Mingjun Zhong, Nigel Goddard and Charles Sutton. In Advances in Neural Information Processing Systems (NIPS). 2015.

[ .pdf | bib ]

@inproceedings{zhong15,
  author = {Zhong, Mingjun and Goddard, Nigel and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Latent Bayesian melding for integrating individual and population models},
  year = {2015}
}

Signal Aggregate Constraints in Additive Factorial HMMs, with Application to Energy Disaggregation. Mingjun Zhong, Nigel Goddard and Charles Sutton. In Advances in Neural Information Processing Systems (NIPS). 2014.

[ .pdf | bib ]

@inproceedings{zhong2014,
  author = {Zhong, Mingjun and Goddard, Nigel and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Signal Aggregate Constraints in Additive Factorial HMMs, with Application to Energy Disaggregation},
  year = {2014}
}

Data Sets

The IDEAL household energy dataset, electricity, gas, contextual sensor data and survey data for 255 UK homes. Martin Pullinger, Jonathan Kilgour, Nigel Goddard, Niklas Berliner, Lynda Webb, Myroslava Dzikovska, Heather Lovell, Janek Mann, Charles Sutton, Janette Webb and Mingjun Zhong. Scientific Data 8 (1). 2021.

[ .pdf | bib | abstract | data ]

The IDEAL household energy dataset described here comprises electricity, gas and contextual data from 255 UK homes over a 23-month period ending in June 2018, with a mean participation duration of 286 days. Sensors gathered 1-second electricity data, pulse-level gas data, 12-second temperature, humidity and light data for each room, and 12-second temperature data from boiler pipes for central heating and hot water. 39 homes also included plug-level monitoring of selected electrical appliances, real-power measurement of mains electricity and key sub-circuits, and more detailed temperature monitoring of gas- and heat-using equipment, including radiators and taps. Survey data included occupant demographics, values, attitudes and self-reported energy awareness, household income, energy tariffs, and building, room and appliance characteristics. Linked secondary data comprises weather and level of urbanisation. The data is provided in comma-separated format with a custom-built API to facilitate usage, and has been cleaned and documented. The data has a wide range of applications, including investigating energy demand patterns and drivers, modelling building performance, and undertaking Non-Intrusive Load Monitoring research.
```
@article{pullinger2021-jg,
  author = {Pullinger, Martin and Kilgour, Jonathan and Goddard, Nigel and Berliner, Niklas and Webb, Lynda and Dzikovska, Myroslava and Lovell, Heather and Mann, Janek and Sutton, Charles and Webb, Janette and Zhong, Mingjun},
  journal = {Scientific Data},
  month = {may},
  number = {1},
  pages = {146},
  title = {The IDEAL household energy dataset, electricity, gas, contextual sensor data and survey data for 255 UK homes},
  volume = {8},
  year = {2021}
}
```

Hardware Design Automation

Learning Semantic Representations to Verify Hardware Designs. Shobha Vasudevan, Wenjie Jiang, David Bieber, Rishabh Singh, Hamid Shojaei, Richard Ho and Charles Sutton. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS). 2021.

[ to appear | bib ]

@inproceedings{vasudevan2021-tr,
  author = {Vasudevan, Shobha and Jiang, Wenjie and Bieber, David and Singh, Rishabh and Shojaei, Hamid and Ho, Richard and Sutton, Charles},
  booktitle = {Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS)},
  title = {Learning Semantic Representations to Verify Hardware Designs},
  year = {2021}
}

Deep Learning

Generative Ratio Matching Networks. Akash Srivastava, Kai Xu, Michael U. Gutmann and Charles Sutton. In International Conference on Learning Representations. 2020.

[ .pdf | bib ]

@inproceedings{srivastava2020generative,
  author = {Srivastava, Akash and Xu, Kai and Gutmann, Michael U. and Sutton, Charles},
  booktitle = {International Conference on Learning Representations},
  title = {Generative Ratio Matching Networks},
  year = {2020}
}

Global Relational Models of Source Code. Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis and David Bieber. In International Conference on Learning Representations. 2020.

[ .pdf | bib ]

@inproceedings{hellendoorn2020global,
  author = {Hellendoorn, Vincent J. and Sutton, Charles and Singh, Rishabh and Maniatis, Petros and Bieber, David},
  booktitle = {International Conference on Learning Representations},
  title = {Global Relational Models of Source Code},
  year = {2020}
}

[ arXiv | bib ]

@inproceedings{tarlow2020graph2diff-ws,
  author = {Tarlow, Daniel and Moitra, Subhodeep and Rice, Andrew and Chen, Zimin and Manzagol, Pierre-Antoine and Sutton, Charles and Aftandilian, Edward},
  booktitle = {ICSE Workshop on Automated Program Repair},
  title = {Learning to Fix Build Errors with Graph2Diff Neural Networks},
  year = {2020}
}

Learning to Fix Build Errors with Graph2Diff Neural Networks. Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton and Edward Aftandilian. 2019.

[ arXiv | bib ]

@unpublished{tarlow2019learning,
  author = {Tarlow, Daniel and Moitra, Subhodeep and Rice, Andrew and Chen, Zimin and Manzagol, Pierre-Antoine and Sutton, Charles and Aftandilian, Edward},
  title = {Learning to Fix Build Errors with Graph2Diff Neural Networks},
  year = {2019}
}

[ .pdf | bib ]

@inproceedings{zhang18,
  author = {Zhang, Chaoyun and Zhong, Mingjun and Wang, Zongzuo and Goddard, Nigel and Sutton, Charles},
  booktitle = {National Conference on Artificial Intelligence (AAAI)},
  title = {Sequence-to-Point Learning with Neural Networks for Non-intrusive Load Monitoring},
  year = {2018}
}

Autoencoding Variational Inference for Topic Models. Akash Srivastava and Charles Sutton. In International Conference on Learning Representations (ICLR). 2017.

[ .pdf | arXiv | bib | discussion | source code ]

@inproceedings{srivastava17lda,
  author = {Srivastava, Akash and Sutton, Charles},
  booktitle = {International Conference on Learning Representations (ICLR)},
  title = {Autoencoding Variational Inference for Topic Models},
  year = {2017}
}

Blending LSTMs into CNNs. Krzysztof J. Geras, Abdel-rahman Mohamed, Rich Caruana, Gregor Urban, Shengjie Wang, Ozlem Aslan, Matthai Philipose, Matthew Richardson and Charles Sutton. In International Conference on Learning Representations (ICLR Workshop). 2016.

[ .pdf | bib ]

@inproceedings{geras16blendingws,
  author = {Geras, Krzysztof J. and Mohamed, Abdel-rahman and Caruana, Rich and Urban, Gregor and Wang, Shengjie and Aslan, Ozlem and Philipose, Matthai and Richardson, Matthew and Sutton, Charles},
  booktitle = {International Conference on Learning Representations (ICLR Workshop)},
  title = {Blending LSTMs into CNNs},
  year = {2016}
}

Composite denoising autoencoders. Krzysztof Geras and Charles Sutton. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECML-PKDD). 2016.

[ .pdf | bib ]

@inproceedings{geras16ecml,
  author = {Geras, Krzysztof and Sutton, Charles},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECML-PKDD)},
  title = {Composite denoising autoencoders},
  year = {2016}
}

Scheduled Denoising Autoencoders. Krzysztof Geras and Charles Sutton. In International Conference on Representation Learning (ICLR). 2015.

[ .pdf | bib ]

@inproceedings{scheda,
  author = {Geras, Krzysztof and Sutton, Charles},
  booktitle = {International Conference on Representation Learning (ICLR)},
  title = {Scheduled Denoising Autoencoders},
  year = {2015}
}

Suggesting Accurate Method and Class Names. Miltiadis Allamanis, Earl T. Barr, Christian Bird and Charles Sutton. In Foundations of Software Engineering (FSE). 2015. (Neural network model that can suggest a name for a method or class, given the method’s body and signature.)

[ .pdf | bib | abstract | source code ]

Descriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, suggesting names for meth- ods and classes is much more difficult. This is because good method and class names need to be functionally descriptive, but suggesting such names requires that the model goes beyond local context. We introduce a neural probabilistic language model for source code that is specifically designed for the method naming problem. Our model learns which names are semantically similar by assigning them to locations, called embeddings, in a high-dimensional continuous space, in such a way that names with similar embeddings tend to be used in similar contexts. These embeddings seem to contain semantic information about tokens, even though they are learned only from statistical co-occurrences of tokens. Furthermore, we introduce a variant of our model that is, to our knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. We obtain state of the art results on the method, class, and even the simpler variable naming tasks. More broadly, the continuous embeddings that are learned by our model have the potential for wide application within software engineering.
```
@inproceedings{neural-naturalize,
  annote = {Neural network model that can suggest a name for a method or class, given the method’s body and signature.},
  author = {Allamanis, Miltiadis and Barr, Earl T. and Bird, Christian and Sutton, Charles},
  booktitle = {Foundations of Software Engineering (FSE)},
  title = {Suggesting Accurate Method and Class Names},
  year = {2015}
}
```

Program Repair

Global Relational Models of Source Code. Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis and David Bieber. In International Conference on Learning Representations. 2020.

[ .pdf | bib ]

@inproceedings{hellendoorn2020global,
  author = {Hellendoorn, Vincent J. and Sutton, Charles and Singh, Rishabh and Maniatis, Petros and Bieber, David},
  booktitle = {International Conference on Learning Representations},
  title = {Global Relational Models of Source Code},
  year = {2020}
}

[ arXiv | bib ]

@inproceedings{karampatsis2019singlestatement,
  author = {Karampatsis, Rafael-Michael and Sutton, Charles},
  booktitle = {Working Conference on Mining Software Repositories (MSR; Data Showcase)},
  title = {How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset},
  year = {2020}
}

[ arXiv | bib ]

@inproceedings{tarlow2020graph2diff-ws,
  author = {Tarlow, Daniel and Moitra, Subhodeep and Rice, Andrew and Chen, Zimin and Manzagol, Pierre-Antoine and Sutton, Charles and Aftandilian, Edward},
  booktitle = {ICSE Workshop on Automated Program Repair},
  title = {Learning to Fix Build Errors with Graph2Diff Neural Networks},
  year = {2020}
}

Graph Neural Networks

Global Relational Models of Source Code. Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis and David Bieber. In International Conference on Learning Representations. 2020.

[ .pdf | bib ]

@inproceedings{hellendoorn2020global,
  author = {Hellendoorn, Vincent J. and Sutton, Charles and Singh, Rishabh and Maniatis, Petros and Bieber, David},
  booktitle = {International Conference on Learning Representations},
  title = {Global Relational Models of Source Code},
  year = {2020}
}

[ arXiv | bib ]

@inproceedings{tarlow2020graph2diff-ws,
  author = {Tarlow, Daniel and Moitra, Subhodeep and Rice, Andrew and Chen, Zimin and Manzagol, Pierre-Antoine and Sutton, Charles and Aftandilian, Edward},
  booktitle = {ICSE Workshop on Automated Program Repair},
  title = {Learning to Fix Build Errors with Graph2Diff Neural Networks},
  year = {2020}
}

Deep Learning

Incremental Sampling Without Replacement for Sequence Models. Kensen Shi, David Bieber and Charles Sutton. In International Conference in Machine Learning (ICML). 2020.

[ arXiv | bib ]

@inproceedings{shi2020incremental,
  author = {Shi, Kensen and Bieber, David and Sutton, Charles},
  booktitle = {International Conference in Machine Learning (ICML)},
  title = {Incremental Sampling Without Replacement for Sequence Models},
  year = {2020}
}

A Convolutional Attention Network for Extreme Summarization of Source Code. Miltiadis Allamanis, Hao Peng and Charles Sutton. In International Conference in Machine Learning (ICML). 2016.

[ .pdf | bib ]

@inproceedings{allamanis16convattn,
  author = {Allamanis, Miltiadis and Peng, Hao and Sutton, Charles},
  booktitle = {International Conference in Machine Learning (ICML)},
  journal = {ArXiv e-prints},
  title = {A Convolutional Attention Network for Extreme Summarization of Source Code},
  year = {2016}
}

Data Cleaning

Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data. Simao Eduardo, Alfredo Nazabal, Christopher K. I. Williams and Charles Sutton. In Conference on Artificial Intelligence and Statistics (AISTATS). 2020.

[ arXiv | bib ]

@inproceedings{eduardo2020robust,
  author = {Eduardo, Simao and Nazabal, Alfredo and Williams, Christopher K. I. and Sutton, Charles},
  booktitle = {Conference on Artificial Intelligence and Statistics (AISTATS)},
  title = {Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data},
  year = {2020}
}

Deep Generative Models

Generative Ratio Matching Networks. Akash Srivastava, Kai Xu, Michael U. Gutmann and Charles Sutton. In International Conference on Learning Representations. 2020.

[ .pdf | bib ]

@inproceedings{srivastava2020generative,
  author = {Srivastava, Akash and Xu, Kai and Gutmann, Michael U. and Sutton, Charles},
  booktitle = {International Conference on Learning Representations},
  title = {Generative Ratio Matching Networks},
  year = {2020}
}

Energy-based Models

Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration. Hanjun Dai, Rishabh Singh, Bo Dai, Charles Sutton and Dale Schuurmans. In Advances in Neural Information Processing Systems (NeurIPS). 2020.

[ .pdf | bib | abstract ]

Discrete structures play an important role in applications like program language modeling and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice in flexibility. Energy-based models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions, but require partition function estimation. In this paper we propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete structured data, where parameter gradients are estimated using a learned sampler that mimics local search. We show that the energy function and sampler can be trained efficiently via a new variational form of power iteration, achieving a better trade-off between flexibility and tractability. Experimentally, we show that learning local search leads to significant improvements in challenging application domains. Most notably, we present an energy model guided fuzzer for software testing that achieves comparable performance to well engineered fuzzing engines like libfuzzer.
```
@inproceedings{dai2020-bc,
  author = {Dai, Hanjun and Singh, Rishabh and Dai, Bo and Sutton, Charles and Schuurmans, Dale},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  title = {Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration},
  year = {2020}
}
```

Data Mining

GEMSEC: Graph Embedding with Self Clustering. Benedek Rozemberczki, Ryan Davies, Rik Sarkar and Charles Sutton. ArXiv e-prints. 2018.

[ arXiv | bib | source code | data ]

@article{rozemberczki2018gemsec,
  author = {Rozemberczki, Benedek and Davies, Ryan and Sarkar, Rik and Sutton, Charles},
  journal = {ArXiv e-prints},
  month = {feb},
  title = {GEMSEC: Graph Embedding with Self Clustering},
  year = {2018}
}

A Subsequence Interleaving Model for Sequential Pattern Mining. Jaroslav Fowkes and Charles Sutton. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016.

[ .pdf | bib | code and data ]

@inproceedings{fowkes16sequence,
  author = {Fowkes, Jaroslav and Sutton, Charles},
  booktitle = {ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  title = {A Subsequence Interleaving Model for Sequential Pattern Mining},
  year = {2016}
}

A Bayesian Network Model for Interesting Itemsets. Jaroslav Fowkes and Charles Sutton. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECML-PKDD). 2016.

[ .pdf | bib | source code ]

@inproceedings{fowkes:iim,
  author = {Fowkes, Jaroslav and Sutton, Charles},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECML-PKDD)},
  journal = {ArXiv e-prints},
  title = {A Bayesian Network Model for Interesting Itemsets},
  year = {2016}
}

Probabilistic Programming

SlicStan: Improving Probabilistic Programming using Information Flow Analysis. Maria I. Gorinova, Andrew D. Gordon and Charles Sutton. In Probabilistic Programming Languages, Semantics, and Systems Workshop at the Symposium on Principles of Programming Languages (PPS 2018). 2018.

[ .pdf | bib ]

@inproceedings{gorinova18,
  author = {Gorinova, Maria I. and Gordon, Andrew D. and Sutton, Charles},
  booktitle = {Probabilistic Programming Languages, Semantics, and Systems Workshop at the Symposium on Principles of Programming Languages (PPS 2018)},
  title = {SlicStan: Improving Probabilistic Programming using Information Flow Analysis},
  year = {2018}
}

SlicStan: A Blockless Stan-like Language. Maria I. Gorinova, Andrew D. Gordon and Charles Sutton. In StanCon. 2018.

[ .pdf | bib ]

@inproceedings{gorinova18stancon,
  author = {Gorinova, Maria I. and Gordon, Andrew D. and Sutton, Charles},
  booktitle = {StanCon},
  title = {SlicStan: A Blockless Stan-like Language},
  year = {2018}
}

Big Code

A Survey of Machine Learning for Big Code and Naturalness. Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu and Charles Sutton. ACM Computing Surveys 51 (4). 2018.

[ arXiv | bib ]

@article{big-code-survey,
  author = {Allamanis, Miltiadis and Barr, Earl T. and Devanbu, Premkumar and Sutton, Charles},
  journal = {ACM Computing Surveys},
  month = {sep},
  number = {4},
  title = {A Survey of Machine Learning for Big Code and Naturalness},
  volume = {51},
  year = {2018}
}

Mining Semantic Loop Idioms from Big Code. Miltiadis Allamanis, Earl T. Barr, Christian Bird, Premkumar Devanbu, Mark Marron and Charles Sutton. Microsoft Research Technical Report, MSR-TR-2016-1116, 2016.

[ .pdf | bib | abstract ]

During maintenance, developers spend a lot of time transforming existing code: refactoring, optimizing, and adding checks to make it more robust. Much of this work is the drudgery of identifying and replacing specific patterns, yet it resists automation, because of meaningful patterns are hard to automatically find. We present a technique for mining loop idioms, surprisingly probable semantic patterns that occur in loops, from big code to find meaningful patterns. First, we show that automatically identifiable patterns exist, in great numbers, with a large scale empirical study of loop over 25 MLOC. We find that loops in this corpus are simple and predictable: 90% of them have fewer than 15LOC and 90% have no nesting and very simple control structure. Encouraged by this result, we coil loops to abstract away syntactic diversity to define information rich loop idioms. We show that only 50 loop idioms cover 50% of the concrete loops. We show how loop idioms can help a tool developers identify and prioritize refactorings. We also show how our framework opens the door to data-driven tool and language design discovering opportunities to introduce new API calls and language constructs: loop idioms show that LINQ would benefit from an Enumerate operator, a result confirmed by the fact that precisely this feature is one of the most requested features on StackOverflow with 197 votes and 95k views.
```
@techreport{semantic-idioms-tr,
  author = {Allamanis, Miltiadis and Barr, Earl T. and Bird, Christian and Devanbu, Premkumar and Marron, Mark and Sutton, Charles},
  institution = {Microsoft Research},
  month = {nov},
  number = {MSR-TR-2016-1116},
  title = {Mining Semantic Loop Idioms from Big Code},
  year = {2016}
}
```

Interpretability

Interpreting Deep Classifier by Visual Distillation of Dark Knowledge. Kai Xu, Dae Hoon Park, Yi Chang and Charles Sutton. ArXiv e-prints. 2018.

[ .pdf | bib ]

@article{darksight,
  author = {Xu, Kai and Park, Dae Hoon and Chang, Yi and Sutton, Charles},
  journal = {ArXiv e-prints},
  title = {Interpreting Deep Classifier by Visual Distillation of Dark Knowledge},
  year = {2018}
}

Data Wrangling

Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling. Charles Sutton, Timothy Hobson, James Geddes and Rich Caruana. In Conference on Knowledge Discovery and Data Mining (KDD). 2018.

[ .pdf | bib ]

@inproceedings{data-diff,
  author = {Sutton, Charles and Hobson, Timothy and Geddes, James and Caruana, Rich},
  booktitle = {Conference on Knowledge Discovery and Data Mining (KDD)},
  title = {Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling},
  year = {2018}
}

Api Mining

[ .pdf | bib | source code ]

@inproceedings{katirtzis18,
  author = {Katirtzis, Nikolaos and Diamantopoulos, Themistoklis and Sutton, Charles},
  booktitle = {International Conference on Fundamental Approaches to Software Engineering (FASE)},
  title = {Summarizing Software API Usage Examples using Clustering Techniques},
  year = {2018}
}

Topic Models

Autoencoding Variational Inference for Topic Models. Akash Srivastava and Charles Sutton. In International Conference on Learning Representations (ICLR). 2017.

[ .pdf | arXiv | bib | discussion | source code ]

@inproceedings{srivastava17lda,
  author = {Srivastava, Akash and Sutton, Charles},
  booktitle = {International Conference on Learning Representations (ICLR)},
  title = {Autoencoding Variational Inference for Topic Models},
  year = {2017}
}

Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code. Miltos Allamanis and Charles Sutton. In Working Conference on Mining Software Repositories (MSR). 2013.

[ .pdf | bib ]

@inproceedings{allamanis13why-when,
  author = {Allamanis, Miltos and Sutton, Charles},
  booktitle = {Working Conference on Mining Software Repositories (MSR)},
  title = {Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code},
  year = {2013}
}

Unsupervised Deduplication using Cross-field Dependencies. Robert Hall, Charles Sutton and Andrew McCallum. In Conference on Knowledge Discovery and Data Mining (KDD). 2008. (Hierarchical DP model that jointly clusters citation venue strings based on both string-edit distance and title information.)

[ .pdf | bib | abstract ]

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent—because venues tend to focus on a few research areas—but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance—a 58% reduction in error over a standard Dirichlet process mixture.
```
@inproceedings{hall08unsupervised,
  annote = {Hierarchical DP model that jointly clusters citation venue strings based on both string-edit distance and title information.},
  author = {Hall, Robert and Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Knowledge Discovery and Data Mining (KDD)},
  title = {Unsupervised Deduplication using Cross-field Dependencies},
  year = {2008}
}
```

[ .pdf | bib ]

@inproceedings{wallach08bayesian,
  annote = {Two Bayesian dependency parsing models: 1. Model with Pitman-Yor prior that significantly improves Eisner’s classic model; 2. Latent-variable model that learns "syntactic" topics.},
  author = {Wallach, Hanna and Sutton, Charles and McCallum, Andrew},
  booktitle = {ICML Workshop on Prior Knowledge for Text and Language Processing},
  title = {Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors},
  year = {2008}
}

Computer Security

Learning and Verifying Unwanted Behaviours. Wei Chen, David Aspinall, Andrew Gordon, Charles Sutton and Igor Muttik. In Workshop on Hot Issues in Security Principles and Trust (HotSpot 2016). 2016.

[ .pdf | bib ]

@inproceedings{chen16learning,
  author = {Chen, Wei and Aspinall, David and Gordon, Andrew and Sutton, Charles and Muttik, Igor},
  booktitle = {Workshop on Hot Issues in Security Principles and Trust (HotSpot 2016)},
  title = {Learning and Verifying Unwanted Behaviours},
  year = {2016}
}

More Semantics More Robust: Improving Android Malware Classifiers. Wei Chen, David Aspinall, Andrew D Gordon, Charles Sutton and Igor Muttik. In ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec). 2016.

[ to appear | bib ]

@inproceedings{chen16wisec,
  author = {Chen, Wei and Aspinall, David and Gordon, Andrew D and Sutton, Charles and Muttik, Igor},
  booktitle = {ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec)},
  title = {More Semantics More Robust: Improving Android Malware Classifiers},
  year = {2016}
}

Misleading learners: Co-opting your spam filter. Blaine Nelson, Marco Barreno, Fuching Jack Chi, Anthony D. Joseph, Benjamin I. P. Rubinstein, Udam Saini, Charles Sutton, J. D. Tygar and Kai Xia. In Tsai, Jeffrey J. P. and Yu, Philip S., editors. Machine Learning in Cyber Trust: Security, Privacy, Reliability. Springer. 2009.

[ .pdf | bib ]

@incollection{nelson09spam,
  author = {Nelson, Blaine and Barreno, Marco and Chi, Fuching Jack and Joseph, Anthony D. and Rubinstein, Benjamin I. P. and Saini, Udam and Sutton, Charles and Tygar, J. D. and Xia, Kai},
  booktitle = {Machine Learning in Cyber Trust: Security, Privacy, Reliability},
  editor = {Tsai, Jeffrey J. P. and Yu, Philip S.},
  publisher = {Springer},
  title = {Misleading learners: Co-opting your spam filter},
  year = {2009}
}

Exploiting Machine Learning to Subvert your Spam Filter. Blaine Nelson, Marco Barreno, Fuching Jack Chi, Anthony D. Joseph, Benjamin I. P. Rubinstein, Udam Saini, Charles Sutton, J. D. Tygar and Kai Xia. In Proceedings of the First USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET). 2008. (Send crafted email to a spam filter to cause it to misclassify your normal email as spam. Initial experiments on defenses to this attack. )

[ .pdf | bib ]

@inproceedings{nelson08spam,
  annote = {Send crafted email to a spam filter to cause it to misclassify your normal email as spam. Initial experiments on defenses to this attack. },
  author = {Nelson, Blaine and Barreno, Marco and Chi, Fuching Jack and Joseph, Anthony D. and Rubinstein, Benjamin I. P. and Saini, Udam and Sutton, Charles and Tygar, J. D. and Xia, Kai},
  booktitle = {Proceedings of the First USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET)},
  title = {Exploiting Machine Learning to Subvert your Spam Filter},
  year = {2008}
}

Queueing Networks

A Bayesian Approach to Parameter Inference in Queueing Networks. Weikun Wang, Giuliano Casale and Charles Sutton. ACM Transactions on Modeling and Computer Simulation 27 (1). 2016.

[ .pdf | bib ]

@article{wang16,
  author = {Wang, Weikun and Casale, Giuliano and Sutton, Charles},
  journal = {ACM Transactions on Modeling and Computer Simulation},
  month = {aug},
  number = {1},
  title = {A Bayesian Approach to Parameter Inference in Queueing Networks},
  volume = {27},
  year = {2016}
}

Bayesian Inference in Queueing Networks. Charles Sutton and Michael I. Jordan. Annals of Applied Statistics 5 (1). 2011.

[ .pdf | bib | source code ]

@article{sutton10qnet,
  author = {Sutton, Charles and Jordan, Michael I.},
  journal = {Annals of Applied Statistics},
  number = {1},
  pages = {254–282},
  title = {Bayesian Inference in Queueing Networks},
  volume = {5},
  year = {2011}
}

Learning and Inference in Queueing Networks. Charles Sutton and Michael I. Jordan. In Conference on Artificial Intelligence and Statistics (AISTATS). 2010. (Conference version of the longer paper "Bayesian Inference in Queueing Networks".)

[ .pdf | bib ]

@inproceedings{sutton10aistats,
  annote = {Conference version of the longer paper "Bayesian Inference in Queueing Networks".},
  author = {Sutton, Charles and Jordan, Michael I.},
  booktitle = {Conference on Artificial Intelligence and Statistics (AISTATS)},
  title = {Learning and Inference in Queueing Networks},
  year = {2010}
}

Probabilistic inference in queueing networks. Charles Sutton and Michael I. Jordan. In Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SYSML). 2008.

[ .pdf | bib ]

@inproceedings{sutton08qnet,
  author = {Sutton, Charles and Jordan, Michael I.},
  booktitle = {Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SYSML)},
  title = {Probabilistic inference in queueing networks},
  year = {2008}
}

Interactive Machine Learning

Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation. Akash Srivastava, James Zou, Ryan P. Adams and Charles Sutton. In Workshop on Human Interpretability in Machine Learning Workshop on Human Interpretability in Machine Learning (co-located with ICML). 2016.

[ .pdf | bib ]

@inproceedings{arxiv:tinder2016,
  author = {Srivastava, Akash and Zou, James and Adams, Ryan P. and Sutton, Charles},
  booktitle = {Workshop on Human Interpretability in Machine Learning Workshop on Human Interpretability in Machine Learning (co-located with ICML)},
  journal = {ArXiv e-prints},
  title = {Clustering with a Reject Option: Interactive Clustering as Bayesian Prior Elicitation},
  year = {2016}
}

Weak Supervision

Latent Bayesian melding for integrating individual and population models. Mingjun Zhong, Nigel Goddard and Charles Sutton. In Advances in Neural Information Processing Systems (NIPS). 2015.

[ .pdf | bib ]

@inproceedings{zhong15,
  author = {Zhong, Mingjun and Goddard, Nigel and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Latent Bayesian melding for integrating individual and population models},
  year = {2015}
}

[ .pdf | bib ]

@inproceedings{zhong2014,
  author = {Zhong, Mingjun and Goddard, Nigel and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Signal Aggregate Constraints in Additive Factorial HMMs, with Application to Energy Disaggregation},
  year = {2014}
}

Approximate Inference

Semi-Separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models. Yichuan Zhang and Charles Sutton. In Advances in Neural Information Processing Systems (NIPS). 2014.

[ .pdf | bib ]

@inproceedings{zhang2014,
  author = {Zhang, Yichuan and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Semi-Separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models},
  year = {2014}
}

[ .pdf | bib ]

@inproceedings{zhang12continuous,
  author = {Zhang, Yichuan and Sutton, Charles and Storkey, Amos and Ghahramani, Zoubin},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Continuous Relaxations for Discrete Hamiltonian Monte Carlo},
  year = {2012}
}

Quasi-Newton Markov chain Monte Carlo. Yichuan Zhang and Charles Sutton. In Advances in Neural Information Processing Systems (NIPS). 2011.

[ .pdf | bib ]

@inproceedings{zhang11quasi,
  author = {Zhang, Yichuan and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {Quasi-Newton Markov chain Monte Carlo},
  year = {2011}
}

Improved Dynamic Schedules for Belief Propagation. Charles Sutton and Andrew McCallum. In Conference on Uncertainty in Artificial Intelligence (UAI). 2007. (Significantly faster version of loopy BP by selecting which messages to send based on an approximation to their residual.)

[ .pdf | bib | abstract ]

Belief propagation and its variants are popular methods for approximate inference, but their running time and even their convergence depend greatly on the schedule used to send the messages. Recently, dynamic update schedules have been shown to converge much faster on hard networks than static schedules, namely the residual BP schedule of Elidan et al. [2006]. But that RBP algorithm wastes message updates: many messages are computed solely to determine their priority, and are never actually performed. In this paper, we show that estimating the residual, rather than calculating it directly, leads to significant decreases in the number of messages required for convergence, and in the total running time. The residual is estimated using an upper bound based on recent work on message errors in BP. On both synthetic and real-world networks, this dramatically decreases the running time of BP, in some cases by a factor of five, without affecting the quality of the solution.
```
@inproceedings{sutton07rbp0,
  annote = {Significantly faster version of loopy BP by selecting which messages to send based on an approximation to their residual.},
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
  title = {Improved Dynamic Schedules for Belief Propagation},
  year = {2007}
}
```
Sparse Forward-Backward using Minimum Divergence Beams for Fast Training of Conditional Random Fields. Chris Pal, Charles Sutton and Andrew McCallum. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2006. (New criterion for adaptive beam size within forward-backward, suggested by a variational perspective. Works well within CRF training.)

[ .pdf | bib | abstract ]

Hidden Markov models and linear-chain conditional random fields (CRFs) are applicable to many tasks in spoken language processing. In large state spaces, however, training can be expensive, because it often requires many iterations of forward-backward. Beam search is a standard heuristic for controlling complexity during Viterbi decoding, but during forward-backward, standard beam heuristics can be dangerous, as they can make training unstable. We introduce sparse forward-backward, a variational perspective on beam methods that uses an approximating mixture of Kronecker delta functions. This motivates a novel minimum-divergence beam criterion based on minimizing KL divergence between the respective marginal distributions. Our beam selection approach is not only more efficient for Viterbi decoding, but also more stable within sparse forward-backward training. For a standard text-to-speech problem, we reduce CRF training time fourfold—from over a day to six hours—with no loss in accuracy.
```
@inproceedings{pal06sparse,
  annote = {New criterion for adaptive beam size within forward-backward, suggested by a variational perspective. Works well within CRF training.},
  author = {Pal, Chris and Sutton, Charles and McCallum, Andrew},
  booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  title = {Sparse Forward-Backward using Minimum Divergence Beams for Fast Training of Conditional Random Fields},
  year = {2006}
}
```

Visualization

Word Storms: Multiples of Word Clouds for Visual Comparison of Documents. Quim Castella and Charles Sutton. In International World Wide Web Conference (WWW). 2014.

[ .pdf | bib ]

@inproceedings{castella14,
  author = {Castella, Quim and Sutton, Charles},
  booktitle = {International World Wide Web Conference (WWW)},
  title = {Word Storms: Multiples of Word Clouds for Visual Comparison of Documents},
  year = {2014}
}

Databases

Supporting User-Defined Functions on Uncertain Data. Thanh T. L. Tran, Yanlei Diao, Charles Sutton and Anna Liu. Proceedings of the VLDB Endowment (PVLDB). 2013.

[ .pdf | bib ]

@article{tran13supporting,
  author = {Tran, Thanh T. L. and Diao, Yanlei and Sutton, Charles and Liu, Anna},
  journal = {Proceedings of the VLDB Endowment (PVLDB)},
  title = {Supporting User-Defined Functions on Uncertain Data},
  year = {2013}
}

Distributed Inference and Query Processing for RFID Tracking and Monitoring. Zhao Cao, Charles Sutton, Yanlei Diao and Prashant Shenoy. Proceedings of the VLDB Endowment (PVLDB) 4 (5). 2011.

[ .pdf | bib ]

@article{cao11vldb,
  author = {Cao, Zhao and Sutton, Charles and Diao, Yanlei and Shenoy, Prashant},
  journal = {Proceedings of the VLDB Endowment (PVLDB)},
  number = {5},
  pages = {326-337},
  title = {Distributed Inference and Query Processing for RFID Tracking and Monitoring},
  volume = {4},
  year = {2011}
}

Capturing Data Uncertainty in High-Volume Stream Processing. Yanlei Diao, Boduo Li, Anna Liu, Liping Peng, Charles Sutton, Thanh Tran and Michael Zink. In Conference on Innovative Data Systems Research (CIDR). 2009.

[ .pdf | bib ]

@inproceedings{diao09cidr,
  author = {Diao, Yanlei and Li, Boduo and Liu, Anna and Peng, Liping and Sutton, Charles and Tran, Thanh and Zink, Michael},
  booktitle = {Conference on Innovative Data Systems Research (CIDR)},
  title = {Capturing Data Uncertainty in High-Volume Stream Processing},
  year = {2009}
}

Probabilistic Inference over RFID Streams in Mobile Environments. Thanh Tran, Charles Sutton, Richard Cocci, Yanming Nie, Yanlei Diao and Prashant Shenoy. In International Conference on Data Engineering (ICDE). 2009.

[ .pdf | bib ]

@inproceedings{tran09rfid,
  author = {Tran, Thanh and Sutton, Charles and Cocci, Richard and Nie, Yanming and Diao, Yanlei and Shenoy, Prashant},
  booktitle = {International Conference on Data Engineering (ICDE)},
  title = {Probabilistic Inference over RFID Streams in Mobile Environments},
  year = {2009}
}

Evaluation Of Machine Learning

Multiple-source Cross Validation. Krzysztof Geras and Charles Sutton. In International Conference on Machine Learning (ICML). 2013.

[ .pdf | bib ]

@inproceedings{geras13multi-source,
  author = {Geras, Krzysztof and Sutton, Charles},
  booktitle = {International Conference on Machine Learning (ICML)},
  title = {Multiple-source Cross Validation},
  year = {2013}
}

Conditional Random Fields

An Introduction to Conditional Random Fields. Charles Sutton and Andrew McCallum. Foundations and Trends in Machine Learning 4 (4). 2012.

[ .pdf | bib | abstract ]

Often we wish to predict a large number of variables that depend on each other as well as on other observed variables. Structured prediction methods are essentially a combination of classification and graphical modeling, combining the ability of graphical models to compactly model multivariate data with the ability of classification methods to perform prediction using large sets of input features. This tutorial describes conditional random fields, a popular probabilistic method for structured prediction. CRFs have seen wide application in natural language processing, computer vision, and bioinformatics. We describe methods for inference and parameter estimation for CRFs, including practical issues for implementing large scale CRFs. We do not assume previous knowledge of graphical modeling, so this tutorial is intended to be useful to practitioners in a wide variety of fields.
```
@article{crftut:fnt,
  author = {Sutton, Charles and McCallum, Andrew},
  journal = {Foundations and Trends in Machine Learning},
  number = {4},
  pages = {267–373},
  title = {An Introduction to Conditional Random Fields},
  volume = {4},
  year = {2012}
}
```
Piecewise Training for Structured Prediction. Charles Sutton and Andrew McCallum. Machine Learning 77 (2–3). 2009. (Train undirected graphical model by splitting into overlapping parts that are trained independently. Connections to pseudolikelihood and Bethe free energy. Journal version of UAI and ICML papers below.)

[ .pdf | bib | abstract ]

A drawback of structured prediction methods is that parameter estimation requires repeated inference, which is intractable for general structures. In this paper, we present an approximate training algorithm called piecewise training that divides the factors into tractable subgraphs, which we call pieces, that are trained independently. Piecewise training can be interpreted as approximating the exact likelihood using belief propagation, and different ways of making this interpretation yield different insights into the method. We also present an extension to piecewise training, called piecewise pseudolikelihood, designed for when variables have large cardinality. On several real-world NLP data sets, piecewise training performs superior to Besag’s pseudolikelihood and sometimes comparably to exact maximum likelihood. In addition, PWPL performs similarly to piecewise and superior to standard pseudolikelihood, but is five to ten times more computationally efficient than batch maximum likelihood training.
```
@article{sutton08piecewise,
  annote = {Train undirected graphical model by splitting into overlapping parts that are trained independently. Connections to pseudolikelihood and Bethe free energy. Journal version of UAI and ICML papers below.},
  author = {Sutton, Charles and McCallum, Andrew},
  journal = {Machine Learning},
  number = {2–3},
  pages = {165–194},
  title = {Piecewise Training for Structured Prediction},
  volume = {77},
  year = {2009}
}
```

Efficient Training Methods for Conditional Random Fields. Charles Sutton. Ph.D. Dissertation, University of Massachusetts, 2008.

[ .pdf | bib ]

@phdthesis{sutton:thesis,
  author = {Sutton, Charles},
  school = {University of Massachusetts},
  title = {Efficient Training Methods for Conditional Random Fields},
  year = {2008}
}

Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Andrew McCallum and Khashayar Rohanimanesh. Journal of Machine Learning Research 8. 2007. (Combination of dynamic Bayesian networks and conditional random fields. Also considers latent-variable model and cascaded training. Journal version of ICML and EMNLP papers below.)

[ .pdf | bib | abstract ]

In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data. In addition to maximum conditional likelihood, we present two alternative approaches for training DCRFs: marginal likelihood training, for when we are primarily interested in predicting only a subset of the variables, and cascaded training, for when we have a distinct data set for each state variable, as in transfer learning. We evaluate marginal training and cascaded training on both synthetic data and real-world text data, finding that marginal training can improve accuracy when uncertainty exists over the latent variables, and that for transfer learning, a DCRF trained in a cascaded fashion performs better than a linear-chain CRF that predicts the final task directly.
```
@article{sutton07dcrf,
  annote = {Combination of dynamic Bayesian networks and conditional random fields. Also considers latent-variable model and cascaded training. Journal version of ICML and EMNLP papers below.},
  author = {Sutton, Charles and McCallum, Andrew and Rohanimanesh, Khashayar},
  journal = {Journal of Machine Learning Research},
  month = {mar},
  pages = {693–723},
  title = {Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data},
  volume = {8},
  year = {2007}
}
```

An Introduction to Conditional Random Fields for Relational Learning. Charles Sutton and Andrew McCallum. In Getoor, Lise and Taskar, Ben, editors. Introduction to Statistical Relational Learning. MIT Press. 2007. (Detailed tutorial on conditional random fields. Includes motivation, background, mathematical foundations, linear-chain form, general-structure form, inference, parameter estimation, and tips and tricks. NOTE: In Equation (1.22), there is a small error. There should not be a summation over k in the final term, just lambda_k / sigma_2. )

[ .pdf | bib ]

@incollection{sutton07introduction,
  annote = {Detailed tutorial on conditional random fields. Includes motivation, background, mathematical foundations, linear-chain form, general-structure form, inference, parameter estimation, and tips and tricks. NOTE: In Equation (1.22), there is a small error. There should not be a summation over k in the final term, just lambda_k / sigma_2. },
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {Introduction to Statistical Relational Learning},
  editor = {Getoor, Lise and Taskar, Ben},
  publisher = {MIT Press},
  title = {An Introduction to Conditional Random Fields for Relational Learning},
  year = {2007}
}

Piecewise Pseudolikelihood for Efficient CRF Training. Charles Sutton and Andrew McCallum. In International Conference on Machine Learning (ICML). 2007. (Train a large CRF in five times faster by dividing it into separate pieces and reducing numbers of predicted variable combinations with pseudolikelihood. Analysis in terms of belief propagation and Bethe energy.)

[ .pdf | bib | abstract ]

Discriminative training of graphical models can be expensive if the variables have large cardinality, even if the graphical structure is tractable. In such cases, pseudolikelihood is an attractive alternative, because its running time is linear in the variable cardinality, but on some data its accuracy can be poor. Piecewise training (Sutton & McCallum, 2005) can have better accuracy but does not scale as well in the variable cardinality. In this paper, we introduce piecewise pseudolikelihood, which retains the computational efficiency of pseudolikelihood but can have much better accuracy. On several benchmark NLP data sets, piecewise pseudolikelihood has better accuracy than standard pseudolikelihood, and in many cases nearly equivalent to maximum likelihood, with five to ten times less training time than batch CRF training.
```
@inproceedings{sutton07pwpl,
  annote = {Train a large CRF in five times faster by dividing it into separate pieces and reducing numbers of predicted variable combinations with pseudolikelihood. Analysis in terms of belief propagation and Bethe energy.},
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {International Conference on Machine Learning (ICML)},
  title = {Piecewise Pseudolikelihood for Efficient CRF Training},
  year = {2007}
}
```
Sparse Forward-Backward using Minimum Divergence Beams for Fast Training of Conditional Random Fields. Chris Pal, Charles Sutton and Andrew McCallum. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2006. (New criterion for adaptive beam size within forward-backward, suggested by a variational perspective. Works well within CRF training.)

[ .pdf | bib | abstract ]

Hidden Markov models and linear-chain conditional random fields (CRFs) are applicable to many tasks in spoken language processing. In large state spaces, however, training can be expensive, because it often requires many iterations of forward-backward. Beam search is a standard heuristic for controlling complexity during Viterbi decoding, but during forward-backward, standard beam heuristics can be dangerous, as they can make training unstable. We introduce sparse forward-backward, a variational perspective on beam methods that uses an approximating mixture of Kronecker delta functions. This motivates a novel minimum-divergence beam criterion based on minimizing KL divergence between the respective marginal distributions. Our beam selection approach is not only more efficient for Viterbi decoding, but also more stable within sparse forward-backward training. For a standard text-to-speech problem, we reduce CRF training time fourfold—from over a day to six hours—with no loss in accuracy.
```
@inproceedings{pal06sparse,
  annote = {New criterion for adaptive beam size within forward-backward, suggested by a variational perspective. Works well within CRF training.},
  author = {Pal, Chris and Sutton, Charles and McCallum, Andrew},
  booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  title = {Sparse Forward-Backward using Minimum Divergence Beams for Fast Training of Conditional Random Fields},
  year = {2006}
}
```
Reducing Weight Undertraining in Structured Discriminative Learning. Charles Sutton, Michael Sindelar and Andrew McCallum. In Conference on Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL). 2006. (Trains multiple linear-chain CRFs with different subsets of features, in order to force dependent sets of features to be able to separately model the class label.)

(This is the published version. An early version had an error in Section 4, under Per-Sequence Mixtures.)

[ .pdf | bib | abstract ]

Discriminative probabilistic models are very popular in NLP because of the latitude they afford in designing features. But training involves complex trade-offs among weights, which can be dangerous: a few highly-indicative features can swamp the contribution of many individually weaker features, causing their weights to be undertrained. Such a model is less robust, for the highly-indicative features may be noisy or missing in the test data. To ameliorate this weight undertraining, we introduce several new feature bagging methods, in which separate models are trained on subsets of the original features, and combined using a mixture model or a product of experts. These methods include the logarithmic opinion pools used by Smith et al. (2005). We evaluate feature bagging on linear-chain conditional random fields for two natural-language tasks. On both tasks, the feature-bagged CRF performs better than simply training a single CRF on all the features.
```
@inproceedings{sutton06reducing,
  annote = {Trains multiple linear-chain CRFs with different subsets of features, in order to force dependent sets of features to be able to separately model the class label.},
  author = {Sutton, Charles and Sindelar, Michael and McCallum, Andrew},
  booktitle = {Conference on Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL)},
  title = {Reducing Weight Undertraining in Structured Discriminative Learning},
  year = {2006}
}
```

Local Training and Belief Propagation. Charles Sutton and Tom Minka. Microsoft Research Technical Report, TR-2006-121, 2006.

[ .pdf | bib ]

@techreport{sutton06local,
  author = {Sutton, Charles and Minka, Tom},
  institution = {Microsoft Research},
  number = {TR-2006-121},
  title = {Local Training and Belief Propagation},
  year = {2006}
}

Learning in Markov Random Fields with Contrastive Free Energies. Max Welling and Charles Sutton. In Conference on Artificial Intelligence and Statistics (AISTATS). 2005.

[ .pdf | bib | abstract ]

Learning Markov random field (MRF) models is notoriously hard due to the presence of a global normalization factor. In this paper we present a new framework for learning MRF models based on the contrastive free energy (CF) objective function. In this scheme the parameters are updated in an attempt to match the average statistics of the data distribution and a distribution which is (partially or approximately) "relaxed" to the equilibrium distribution. We show that maximum likelihood, mean field, contrastive divergence and pseudo-likelihood objectives can be understood in this paradigm. Moreover, we propose and study a new learning algorithm: the "kstep Kikuchi/Bethe approximation". This algorithm is then tested on a conditional random field model with "skip-chain" edges to model long range interactions in text data. It is demonstrated that with no loss in accuracy, the training time is brought down from 19 hours (BP based learning) to 83 minutes, an order of magnitude improvement.
```
@inproceedings{welling05cf,
  author = {Welling, Max and Sutton, Charles},
  booktitle = {Conference on Artificial Intelligence and Statistics (AISTATS)},
  title = {Learning in Markov Random Fields with Contrastive Free Energies},
  year = {2005}
}
```

Fast, Piecewise Training for Discriminative Finite-state and Parsing Models. Charles Sutton and Andrew McCallum. Center for Intelligent Information Retrieval Technical Report, IR-403, 2005.

[ .pdf | bib ]

@techreport{sutton05fast,
  author = {Sutton, Charles and McCallum, Andrew},
  institution = {Center for Intelligent Information Retrieval},
  number = {IR-403},
  title = {Fast, Piecewise Training for Discriminative Finite-state and Parsing Models},
  year = {2005}
}

Piecewise Training of Undirected Models. Charles Sutton and Andrew McCallum. In Conference on Uncertainty in Artificial Intelligence (UAI). 2005. (Train large CRF by dividing into pieces and training independently. The explanation in this paper for why it works is somewhat unsatisfying. Consult journal version (2008) for a better story.)

[ .pdf | bib | abstract ]

For many large undirected models that arise in real-world applications, exact maximumlikelihood training is intractable, because it requires computing marginal distributions of the model. Conditional training is even more difficult, because the partition function depends not only on the parameters, but also on the observed input, requiring repeated inference over each training example. An appealing idea for such models is to independently train a local undirected classifier over each clique, afterwards combining the learned weights into a single global model. In this paper, we show that this piecewise method can be justified as minimizing a new family of upper bounds on the log partition function. Our bounds are derived from the tree-reweighted upper bounds of Wainwright, Jaakkola, and Willsky, where the component subgraphs are restricted to disjoint pieces of the model. The choice of disjoint subgraphs is especially suited to conditional training because it avoids the usual need to invoke a messagepassing algorithm many times during training. On three natural-language data sets, piecewise training is more accurate than pseudolikelihood, and often performs comparably to global training using belief propagation.
```
@inproceedings{sutton05piecewise,
  annote = {Train large CRF by dividing into pieces and training independently. The explanation in this paper for why it works is somewhat unsatisfying. Consult journal version (2008) for a better story.},
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
  title = {Piecewise Training of Undirected Models},
  year = {2005}
}
```

[ .pdf | bib ]

@inproceedings{sutton05transfer,
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP)},
  title = {Composition of Conditional Random Fields for Transfer Learning},
  year = {2005}
}

[ .pdf | bib ]

@inproceedings{sutton04skip,
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields},
  title = {Collective Segmentation and Labeling of Distant Entities in Information Extraction},
  year = {2004}
}

Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. In International Conference on Machine Learning (ICML). 2004. (Combination of dynamic Bayesian networks and conditional random fields, with experiments in noun-phrase chunking.)

[ .pdf | bib | abstract ]

In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data.
```
@inproceedings{sutton04dcrf,
  annote = {Combination of dynamic Bayesian networks and conditional random fields, with experiments in noun-phrase chunking.},
  author = {Sutton, Charles and Rohanimanesh, Khashayar and McCallum, Andrew},
  booktitle = {International Conference on Machine Learning (ICML)},
  title = {Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data},
  year = {2004}
}
```
Piecewise Training with Parameter Independence Diagrams: Comparing Globally- and Locally-trained Linear-chain CRFs. Andrew McCallum and Charles Sutton. In NIPS Workshop on Learning with Structured Outputs. 2004.

[ .pdf | bib | abstract ]

We present a diagrammatic formalism and practial methods for introducing additional independence assumptions into parameter estimation, enabling efficient training of undirected graphical models in locally-normalized pieces. On two real-world data sets we demonstrate our locally-trained linear-chain CRFs outperforming traditional CRFs, training in less than one-fifth the time, and providing a statisticallysignificant gain in accuracy.
```
@inproceedings{mccallum04piecewise,
  author = {McCallum, Andrew and Sutton, Charles},
  booktitle = {NIPS Workshop on Learning with Structured Outputs},
  title = {Piecewise Training with Parameter Independence Diagrams: Comparing Globally- and Locally-trained Linear-chain CRFs},
  year = {2004}
}
```
Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences. Andrew McCallum, Khashayar Rohanimanesh and Charles Sutton. In NIPS Workshop on Syntax, Semantics, and Statistics. 2003.

[ .pdf | bib | abstract ]

Conditional random fields (CRFs) for sequence modeling have several advantages over joint models such as HMMs, including the ability to relax strong independence assumptions made in those models, and the ability to incorporate arbitrary overlapping features. Previous work has focused on linear-chain CRFs, which correspond to finite-state machines, and have efficient exact inference algorithms. Often, however, we wish to label sequence data in multiple interacting ways—for example, performing part-of-speech tagging and noun phrase segmentation simultaneously, increasing joint accuracy by sharing information between them. We present dynamic conditional random fields (DCRFs), which are CRFs in which each time slice has a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks—and parameters are tied across slices. (They could also be called conditionally trained Dynamic Markov Networks.) Since exact inference can be intractable in these models, we perform approximate inference using the tree-based reparameterization framework (TRP). We also present empirical results comparing DCRFs with linear-chain CRFs on natural-language data.
```
@inproceedings{mccallum03dcrf,
  author = {McCallum, Andrew and Rohanimanesh, Khashayar and Sutton, Charles},
  booktitle = {NIPS Workshop on Syntax, Semantics, and Statistics},
  month = {dec},
  title = {Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences},
  year = {2003}
}
```

Systems / Machine Learning

Automatic Exploration of Datacenter Performance Regimes. Peter Bodik, Rean Griffith, Charles Sutton, Armando Fox, Michael I. Jordan and David A. Patterson. In First Workshop on Automated Control for Datacenters and Clouds (ACDC ’09). 2009.

[ .pdf | bib ]

@inproceedings{bodik09exploration,
  author = {Bodik, Peter and Griffith, Rean and Sutton, Charles and Fox, Armando and Jordan, Michael I. and Patterson, David A.},
  booktitle = {First Workshop on Automated Control for Datacenters and Clouds (ACDC ’09)},
  title = {Automatic Exploration of Datacenter Performance Regimes},
  year = {2009}
}

Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters. Peter Bodik, Rean Griffith, Charles Sutton, Armando Fox, Michael I. Jordan and David A. Patterson. In Workshop on Hot Topics in Cloud Computing (HotCloud ’09). 2009.

[ .pdf | bib ]

@inproceedings{bodik09hotcloud,
  author = {Bodik, Peter and Griffith, Rean and Sutton, Charles and Fox, Armando and Jordan, Michael I. and Patterson, David A.},
  booktitle = {Workshop on Hot Topics in Cloud Computing (HotCloud ’09)},
  title = {Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters},
  year = {2009}
}

Response-Time Modeling for Resource Allocation and Energy-Informed SLAs. Peter Bodik, Charles Sutton, Armando Fox, David Patterson and Michael I. Jordan. In NIPS Workshop on Statistical Learning Techniques for Solving Systems Problems (MLSys 07). 2007. (Quantile regression (both parametric and non-) for predicting the performance of a web service as a function of workload and power consumption. Much better for voltage control than built-in frequency scaling.)

[ .pdf | bib ]

@inproceedings{bodik07response-time,
  annote = {Quantile regression (both parametric and non-) for predicting the performance of a web service as a function of workload and power consumption. Much better for voltage control than built-in frequency scaling.},
  author = {Bodik, Peter and Sutton, Charles and Fox, Armando and Patterson, David and Jordan, Michael I.},
  booktitle = {NIPS Workshop on Statistical Learning Techniques for Solving Systems Problems (MLSys 07)},
  title = {Response-Time Modeling for Resource Allocation and Energy-Informed SLAs},
  year = {2007}
}

Nonparametric Bayesian Models

Unsupervised Deduplication using Cross-field Dependencies. Robert Hall, Charles Sutton and Andrew McCallum. In Conference on Knowledge Discovery and Data Mining (KDD). 2008. (Hierarchical DP model that jointly clusters citation venue strings based on both string-edit distance and title information.)

[ .pdf | bib | abstract ]

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent—because venues tend to focus on a few research areas—but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance—a 58% reduction in error over a standard Dirichlet process mixture.
```
@inproceedings{hall08unsupervised,
  annote = {Hierarchical DP model that jointly clusters citation venue strings based on both string-edit distance and title information.},
  author = {Hall, Robert and Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Knowledge Discovery and Data Mining (KDD)},
  title = {Unsupervised Deduplication using Cross-field Dependencies},
  year = {2008}
}
```

[ .pdf | bib ]

@inproceedings{wallach08bayesian,
  annote = {Two Bayesian dependency parsing models: 1. Model with Pitman-Yor prior that significantly improves Eisner’s classic model; 2. Latent-variable model that learns "syntactic" topics.},
  author = {Wallach, Hanna and Sutton, Charles and McCallum, Andrew},
  booktitle = {ICML Workshop on Prior Knowledge for Text and Language Processing},
  title = {Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors},
  year = {2008}
}

Information Extraction

Unsupervised Deduplication using Cross-field Dependencies. Robert Hall, Charles Sutton and Andrew McCallum. In Conference on Knowledge Discovery and Data Mining (KDD). 2008. (Hierarchical DP model that jointly clusters citation venue strings based on both string-edit distance and title information.)

[ .pdf | bib | abstract ]

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent—because venues tend to focus on a few research areas—but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a non-exchangeable string-edit model over venues. We show that modeling cross-field dependence yields a substantial improvement in performance—a 58% reduction in error over a standard Dirichlet process mixture.
```
@inproceedings{hall08unsupervised,
  annote = {Hierarchical DP model that jointly clusters citation venue strings based on both string-edit distance and title information.},
  author = {Hall, Robert and Sutton, Charles and McCallum, Andrew},
  booktitle = {Conference on Knowledge Discovery and Data Mining (KDD)},
  title = {Unsupervised Deduplication using Cross-field Dependencies},
  year = {2008}
}
```

[ .pdf | bib ]

@inproceedings{sutton04skip,
  author = {Sutton, Charles and McCallum, Andrew},
  booktitle = {ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields},
  title = {Collective Segmentation and Labeling of Distant Entities in Information Extraction},
  year = {2004}
}