Lapata, Maria. 2001. The Acquisition and Modeling of Lexical Knowledge: A Corpus-based Investigation of Systematic Polysemy. PhD Thesis, University of Edinburgh.

This thesis deals with the acquisition and probabilistic modeling of lexical knowledge. A considerable body of work in lexical semantics concentrates on describing and representing systematic polysemy, i.e., the regular and predictable meaning alternations certain classes of words are subject to. Although the prevalence of the phenomenon has been long recognized, systematic empirical studies of regular polysemy are largely absent, both with respect to the acquisition of systematic polysemous lexical units and the disambiguation of their meaning.

The present thesis addresses both tasks. First, we use insights from linguistic theory to guide and structure the acquisition of systematically polysemous units from domain independent wide-coverage text. Second, we constrain ambiguity by developing a probabilistic framework which provides a ranking on the range of meanings for systematically polysemous words in the absence of discourse context.

We focus on meaning alternations with syntactic effects and exploit the correspondence between meaning and syntax to inform the acquisition process. The acquired information is useful for empirically testing and validating linguistic generalizations, extending their coverage and quantifying the degree to which they are productive. We acquire lexical semantic information automatically using partial parsing and a heuristic approach which exploits fixed correspondences between surface syntactic cues and lexical meaning. We demonstrate the generality of our proposal by applying it to verbs and their complements, adjective-noun combinations, and noun-noun compounds. For each phenomenon we rely on insights from linguistic theory: for verbs we exploit Levin's (1993) influential classification of verbs on the basis of their meaning and syntactic behavior; for compound nouns we make use of Levi's (1978) classification of semantic relations, and finally we look at Vendler's (1968) and Pustejovsky' (1995) generalizations about adjectival meaning.

We present a simple probabilistic model that uses the acquired distributions to select the dominant meaning from a set of meanings arising from syntactically related word combinations. Default meaning --the dominant meaning of polysemous words in the absence of explicit contextual information to the contrary-- is modeled probabilistically in a Bayesian framework which combines observed linguistic dependencies (in the form of conditional probabilities) with linguistic generalizations (in the form of prior probabilities derived from classifications such as Levin's (1993)). Our studies explore a range of model properties: (a) its generality, (b) the representation of the phenomenon under consideration (i.e., the choice of the model variables), (c) the simplification of its parameter space through independence assumptions, and (d) the estimation of the model parameters. Our findings show that the model is general enough to account for different types of lexical units (verbs and their complements, adjective-noun combinations, and noun-noun compounds) under varying assumptions about data requirements (sufficient versus sparse data) and meaning representations (corpus internal or corpus external).

@PhdThesis{Lapata:thesis:01,
  author = 	 {Maria Lapata},
  title = 	 {The Acquisition and Modeling of Lexical Knowledge: 
                  A Corpus-based Investigation of Systematic Polysemy},
  school = 	 {University of Edinburgh},
  year = 	 2001
}