Ranking Word Senses for Disambiguation: Models and Applications, Research Project Grant, funded by the EPSRC, 09/2005-09/2008.

Principal Investigators: John Carroll, Diana McCarthy , and Mirella Lapata

Research Fellow: Rob Koeling

PhD Student: Sam Brody

When faced with the question 'Which plants thrive in chalky soil?" humans have no trouble understanding that the plants are floral rather than industrial. Furthermore, humans recognise that the answers "Sweetcorn and cabbage family vegetables do well on chalky soil", "Sweetcorn and cabbage grow well on chalky ground", and "Maize and cabbage-like vegetables grow well on chalky soil" are all paraphrases and mean more or less the same thing. Semantic interpretation and disambiguation is performed effortlessly by humans but poses great difficulties to computer-based applications that extract, filter and manipulate information from textual data. Examples include Question Answering and Information Retrieval. With the rapidly growing amounts of text being stored by businesses and available over the Internet, such applications become increasingly important and timely and the development of improved methods for identifying the intended meaning of words (word senses) is a key technology for them.

The most accurate techniques for word sense disambiguation (WSD) to date are those which are trained on text in which each word has been manually annotated with its intended sense. A major shortcoming of these methods, though, is that accuracy is strongly correlated with the quantity of training data available, and this is in short supply because its production is very labour-intensive. For many words the distribution of their senses is highly skewed and WSD systems work best when they take the most frequent sense into account. However, the most frequent sense of a word is often not known, particularly in domains (subject areas) in which no text has ever been manually annotated. In this project we will develop novel ways of estimating the frequency distributions of senses of words from raw (unannotated) text. We will exploit these distributions in WSD systems which do not rely on the availability of hand-labelled resources and will demonstrate the benefits of our methods in application to Question Answering.