Charles Sutton

Research

Update Sept 2018: I have now moved to Google and I am no longer taking on students at Edinburgh.

My research concerns a broad range of applications of probabilistic methods for machine learning, including software engineering, natural language processing, computer security, queueing theory, and sustainable energy. Although these applications are disparate, they are connected by an underlying statistical methodology in probabilistic modelling and techniques for approximate inference in graphical models.

My research strategy is based on the idea that sufficiently difficult applications motive the development of new methodology. I aim to develop new machine learning methods based on this interplay of theory and practice.

I am part of a large machine learning group at Edinburgh. Here is some information for prospective students in the group.

My position is funded through the Scottish Informatics and Computer Science Alliance.

Recent Publications

Please see my full list of publications, or my list of publications, sorted by topic.

Here are a few recent highlights:

Can Large Language Models Reason about Program Invariants?. Kexin Pei, David Bieber, Kensen Shi, Charles Sutton and Pengcheng Yin. In International Conference on Machine Learning. 2023.

[ .pdf | bib | abstract ]

Identifying invariants is an important program analysis task with applications towards program understanding, bug finding, vulnerability analysis, and formal verification. Existing tools for identifying program invariants rely on dynamic analysis, requiring traces collected from multiple executions in order to produce reliable invariants. We study the application of large language models to invariant prediction, finding that models trained on source code and fine-tuned for invariant generation can perform invariant prediction as static rather than dynamic analysis. Using a scratchpad approach where invariants are predicted sequentially through a program gives the best performance, finding invariants statically of quality comparable to those obtained by a dynamic analysis tool with access to five program traces.
```
@inproceedings{pei23invariants,
  author = {Pei, Kexin and Bieber, David and Shi, Kensen and Sutton, Charles and Yin, Pengcheng},
  booktitle = {International Conference on Machine Learning},
  month = {jun},
  title = {Can Large Language Models Reason about Program Invariants?},
  year = {2023}
}
```
Natural Language to Code Generation in Interactive Data Science Notebooks. Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov and Charles Sutton. In Proceedings of the Association of Computational Linguistics (ACL). 2023.

[ arXiv | bib | abstract | source code ]

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
```
@inproceedings{yin23arcade,
  author = {Yin, Pengcheng and Li, Wen-Ding and Xiao, Kefan and Rao, Abhishek and Wen, Yeming and Shi, Kensen and Howland, Joshua and Bailey, Paige and Catasta, Michele and Michalewski, Henryk and Polozov, Alex and Sutton, Charles},
  booktitle = {Proceedings of the Association of Computational Linguistics (ACL)},
  title = {Natural Language to Code Generation in Interactive Data Science Notebooks},
  year = {2023}
}
```

A Survey of Machine Learning for Big Code and Naturalness. Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu and Charles Sutton. ACM Computing Surveys 51 (4). 2018.

[ arXiv | bib ]

@article{big-code-survey,
  author = {Allamanis, Miltiadis and Barr, Earl T. and Devanbu, Premkumar and Sutton, Charles},
  journal = {ACM Computing Surveys},
  month = {sep},
  number = {4},
  title = {A Survey of Machine Learning for Big Code and Naturalness},
  volume = {51},
  year = {2018}
}

Autoencoding Variational Inference for Topic Models. Akash Srivastava and Charles Sutton. In International Conference on Learning Representations (ICLR). 2017.

[ .pdf | arXiv | bib | discussion | source code ]

@inproceedings{srivastava17lda,
  author = {Srivastava, Akash and Sutton, Charles},
  booktitle = {International Conference on Learning Representations (ICLR)},
  title = {Autoencoding Variational Inference for Topic Models},
  year = {2017}
}

VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning. Akash Srivastava, Lazar Valkov, Chris Russell, Michael Gutmann and Charles Sutton. In Advances in Neural Information Processing Systems (NIPS). 2017.

[ .pdf | bib | abstract | code and data ]

Deep generative models provide powerful tools for distributions over complicated manifolds, such as those of natural images. But many of these methods, including generative adversarial networks (GANs), can be difficult to train, in part because they are prone to mode collapse, which means that they characterize only a few modes of the true distribution. To address this, we introduce VEEGAN, which features a reconstructor network, reversing the action of the generator by mapping from data to noise. Our training objective retains the original asymptotic consistency guarantee of GANs, and can be interpreted as a novel autoencoder loss over the noise. In sharp contrast to a traditional autoencoder over data points, VEEGAN does not require specifying a loss function over the data, but rather only over the representations, which are standard normal by assumption. On an extensive set of synthetic and real world image datasets, VEEGAN indeed resists mode collapsing to a far greater extent than other recent GAN variants, and produces more realistic samples.
```
@inproceedings{srivastava17veegan,
  author = {Srivastava, Akash and Valkov, Lazar and Russell, Chris and Gutmann, Michael and Sutton, Charles},
  booktitle = {Advances in Neural Information Processing Systems (NIPS)},
  title = {VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning},
  year = {2017}
}
```
An Introduction to Conditional Random Fields. Charles Sutton and Andrew McCallum. Foundations and Trends in Machine Learning 4 (4). 2012.

[ .pdf | bib | abstract ]

Often we wish to predict a large number of variables that depend on each other as well as on other observed variables. Structured prediction methods are essentially a combination of classification and graphical modeling, combining the ability of graphical models to compactly model multivariate data with the ability of classification methods to perform prediction using large sets of input features. This tutorial describes conditional random fields, a popular probabilistic method for structured prediction. CRFs have seen wide application in natural language processing, computer vision, and bioinformatics. We describe methods for inference and parameter estimation for CRFs, including practical issues for implementing large scale CRFs. We do not assume previous knowledge of graphical modeling, so this tutorial is intended to be useful to practitioners in a wide variety of fields.
```
@article{crftut:fnt,
  author = {Sutton, Charles and McCallum, Andrew},
  journal = {Foundations and Trends in Machine Learning},
  number = {4},
  pages = {267–373},
  title = {An Introduction to Conditional Random Fields},
  volume = {4},
  year = {2012}
}
```

Finally, I have a collection of brief, tutorial-style research notes (very old).

Research Group

I collaborate with a wonderful group of students and researchers who have, for whatever reason, chosen to go under the name CUP: Charles's Uncertain People. We have a weekly reading group, to which all are welcome.

A subgroup of CUP, called MAST (Machine learning for the Analysis of Source code Text), focuses on machine learning for software engineering and programming languages. Our software in this area is available via the MAST Github group.

Members of my research group

Current and former members of my group at the CUP group web site.

Projects

Some of my research projects have dedicated pages.

Machine Learning for Computer Programs
- Naturalize: Learning stylistic conventions around names in code
- Suggesting accurate method and class names using neural network language models
- Extreme source code summariation using deep learning
- Learn more at our living literature review of ML for code.
Mining interesting stuff: Probabilistic machine learning for data mining and understanding large data sets.
IDEAL: Home energy advice using machine learning

But not all of my research fits into one of these web sites. To get the whole story, read all of my papers!

Advisors, Mentors, Collaborators

My graduate advisor was Andrew McCallum at the University of Massachusetts Amherst.
I did a postdoc at the University of California, Berkeley working with Michael I. Jordan. I also collaborated with Dave Patterson, Randy Katz, Armando Fox, and Anthony Joseph in networking and systems. I participated in the RAD Lab, which focused on issues in the design and management of data center applications.
I worked as a intern at Microsoft Research with Tom Minka.
Other collaborators include Earl Barr (UCL), Zoubin Ghahramani (Cambridge), Max Welling (University of Amsterdam), Chris Pal (Ecole Polytechnique de Montréal), Khashayar Rohanimanesh (UMass), Yanlei Diao (Ecole Polytechnique), Prashant Shenoy (UMass), Hanna Wallach (Microsoft Research), Peter Bodik (Microsoft Research), Rob Hall (TripAdvisor), Michael Sindelar (Uber).

Personal

Hobbies: I live with cats and fish, who don't interact as much as you might think. I've played a few computer games, mostly adventure games and RPGs. I play Go (圍棋, 囲碁, 바둑). If you would like to know where to play Go in person, try the American Go Association or the British Go Association. I enjoy cooking.

When I was in university, I was a bit sillier than I am now, so I created a silly web site called al.oysi.us. The URL is easy to remember, because as I'm sure you're aware, Aloysius is my middle name. Warning: May not suitable for the silliness-challenged.

Does this page seem a bit boring? That's because you haven't cracked the Easter egg yet.

Charles Sutton (Bio)