Computational Learning of Probabilistic Grammars in the Unsupervised Setting
Shay Cohen, 2011

Abstract

With the rising amount of available multilingual text data, computational linguistics faces an opportunity and a challenge. This text can enrich the domains of NLP applications and improve their performance. Traditional supervised learning for this kind of data would require annotation of part of this text for induction of natural language structure. For these large amounts of rich text, such an annotation task can be daunting and expensive. Unsupervised learning of natural language structure can compensate for the need for such annotation.

Natural language structure can be modeled through the use of probabilistic grammars, generative statistical models which are useful for compositional and sequential structures. Probabilistic grammars are widely used in natural language processing, but they are also used in other fields as well, such as computer vision, computational biology and cognitive science. This dissertation focuses on presenting a theoretical and an empirical analysis for the learning of these widely used grammars in the unsupervised setting.

We analyze computational properties involved in estimation of probabilistic grammars: the computational complexity of the inference problem and the sample complexity of the learning problem. We show that the common inference problems for probabilistic grammars are computationally hard, even though a polynomial sample is sufficient for accurate estimation. We also give a variational inference framework for estimation of probabilistic grammars in the empirical Bayesian setting, which permits the use of non-conjugate priors with probabilistic grammars as well as parallelizable inference. The estimation techniques we use include two types of priors on probabilistic grammars: logistic normal priors and adaptor grammars. We further extend the logistic normal priors to shared logistic normal priors, which define a distribution over a collection of multinomials that represent a probabilistic grammar.

We test our estimation techniques on treebanks in eleven languages. Our empirical evaluation shows that our estimation techniques are useful and perform better than several Bayesian and non-Bayesian baselines.


Download thesis: [pdf] [pdf - doublespace]


This material is based upon work supported by the National Science Foundation under Grant No. 0915187.