People often ask me what they can read to learn more about recent Bayesian modeling techniques and their applications to language learning. Here is a list of the papers I have found to be most useful and relevant to my own research. I try to emphasize the papers aimed at a slightly less technical/more cognitively inclined audience. This is not intended to be a complete list, only a starting point.

**Note:** This list has not been updated since 2008, in part because the
area has now expanded considerably, and keeping it up-to-date would be difficult. But
I've decided to keep this list up in case it's still useful to people.

** General introductory material **

Thomas L. Griffiths and Alan Yuille (2006). A primer on probabilistic inference. Trends in Cognitive Sciences. Supplement to special issue on Probabilistic Models of Cognition (volume 10, issue 7).

- Reviews many of the basic concepts underlying probabilistic (especially Bayesian) modeling and inference, using simple examples.

Sharon Goldwater (2006). Nonparametric Bayesian Models of Lexical Acquisition. Unpublished doctoral dissertation, Brown University, 2006.

- Aimed primarily at computational linguists, but should (I hope) be accessible to anyone who has a basic familiarity with generative probabilistic models. Chapters 2 and 3 cover many useful topics, including Bayesian integration in finite and infinite models (i.e., Dirichlet distribution, Dirichlet process, Chinese restaurant process) and a brief introduction to sampling techniques (Gibbs sampling and Metropolis-Hastings sampling).

- A very nice introduction to Dirichlet processes aimed at cognitive scientists. Slightly more in-depth, covers the stick-breaking construction for the Dirichlet process (which is not in my thesis) as well as the Chinese restaurant process.

** Bayesian language models for learning **

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson (2007). Distributional Cues to Word Segmentation: Context is Important. Proceedings of the 31st Boston University Conference on Language Development.

Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson (2006). Contextual Dependencies in Unsupervised Word Segmentation. Proceedings of Coling/ACL.

- These two papers apply the Dirichlet process and hierarchical Dirichlet process to word segmentation. The BUCLD paper is more conceptual, the ACL paper is more technical. For a more in-depth treatment, see also Chapter 5 of my thesis (above).

Sharon Goldwater and Thomas L. Griffiths. A Fully Bayesian Approach to Unsupervised Part-of-Speech Tagging. Proceedings of the Association for Computational Linguistics.

- This paper provides a direct comparison between Bayesian methods (averaging over parameters and estimation using Gibbs sampling) and standard methods (estimating parameters directly using EM) using the same underlying model (a standard finite HMM).

Mark Johnson (2007). Why Doesn't EM Find Good HMM POS-Taggers? Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).

- Includes Variational Bayes as well as Gibbs sampling and EM as estimation procedures. Results are somewhat contradictory to Goldwater and Griffiths, possibly due to the combination of a simpler model and more training data.

Percy Liang, Slav Petrov, Michael I. Jordan, Dan Klein (2007). The infinite PCFG using hierarchical Dirichlet processes. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CoNLL).

Jenny Rose Finkel, Trond Grenager and Christopher D. Manning (2007). The Infinite Tree. Proceedings of the Association for Computational Linguistics.

Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater (2007). Adaptor Grammars: a Framework for Specifying Compositional Nonparametric Bayesian Models. Advances in Neural Information Processing Systems 19.

- These three papers all deal with nonparametric models of syntax (dependency or context-free grammars). They might be a bit tough for those with less background in nonparametrics, although the exposition in Liang et al. is very nice.

Thomas L. Griffiths, Michael Steyvers, and Joshua B. Tenenbaum (2007). Topics in semantic representation. Psychological Review, 114, 211-244.

Thomas L. Griffiths, Michael Steyvers, David M. Blei, and Joshua B. Tenenbaum (2005). Integrating topics and syntax. Advances in Neural Information Processing Systems 17.

David Blei, Andrew Ng, and Michael Jordan (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022. (A shorter version appeared in NIPS 2002).

- These three papers are about Latent Dirichlet Allocation (a.k.a. topic models) for learning semantic structure. The Psych Review paper provides a less technical introduction and considers LDA as a cognitive model. The JMLR paper is the original one, suitable if you want more technical details. The NIPS paper is just cool.

Fei Xu and Joshua B. Tenenbaum (2007). Word learning as Bayesian inference. Psychological Review, 114, 245-272.

- Develops a Bayesian model to explain how children learn words at different levels of specificity (basic-level categories versus subordinate or superordinate).

** Bayesian models of language processing **

This isn't really my area, but here are a couple of interesting papers I know of:

Dennis Norris (2006). The Bayesian reader: explaining word recognition as an optimal Bayesian decision process. Psychological Review, 113(2), 327-357.

Naomi Feldman and Thomas L. Griffiths (2007). A rational account of the perceptual magnet effect. Proceedings of the Twenty-Ninth Annual Conference of the Cognitive Science Society.

** Inference **

A bunch of the papers mentioned above have descriptions of sampling algorithms and/or variational inference procedures for specific models. For more general information on these topics, consider reading some of the following:

Sharon Goldwater (2006). Nonparametric Bayesian Models of Lexical Acquisition. Unpublished doctoral dissertation, Brown University, 2006.

- As I mentioned above, there is a brief overview of Markov chain Monte Carlo methods (Gibbs sampling and Metropolis-Hastings) in Chapter 2. Examples of Gibbs sampling algorithms are described in chapters 4 and 5.

Julian Besag (2000). Markov chain Monte Carlo for statistical inference. Working paper no. 9. University of Washington Center for Statistics and the Social Sciences.

- A longer and more technical introduction to Markov chain Monte Carlo methods.

Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater (2007). Bayesian Inference for PCFGs via Markov Cain Monte Carlo. Proceedings of the North American Association for Computational Linguistics.

- How to do efficient sampling for PCFGs.

- I don't know much about variational methods myself, but I've been told this is a good place to start.

** Further Reading **

Yee Whye Teh, Michael Jordan, Matthew Beal, and David Blei (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2006. 101(476):1566-1581.

- The original HDP paper. Comprehensive, but I would suggest getting familiar with the ideas using some of the resources above before reading this one.

Radford Neal (1993). Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical report CRG-TR-93-1. University of Toronto Department of Computer Science.

- Even more information about Markov chain Monte Carlo methods.