USA phone: +1-858-356-4124
UK phone: +44-131-516-7940
PhD in language processing for information retrieval
I recently completed my PhD working with Bruce Croft, Victor Lavrenko and Jon Oberlander on applying syntactic and semantic language processing to improve open domain, ad hoc information retrieval performance for natural language queries. I have demonstrated significantly improved performance over state-of-the-art search techniques using graph-based methods to rank and select syntactic/semantic word dependencies. Selected terms are applied in linear feature model using a combination language modelling/inference network system (Indri). Retrieval effectiveness is as good, or better, than the best published results using complex optimisation methods for descriptive queries, but uses only 1-5 well-chosen phrases in addition to a unigram query representation. The `killer phrases' are identified using semantic relations from a dependency parse, supplemented with distributional constraints from a local affinity graph. The improvements in retrieval performance appear to be linked to shallow capture of natural inference about long-distance semantic word relations.
From May 2011 to July 2014, I worked at the Centre for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst, with Bruce Croft. I was also a visiting researcher at the CIIR in summer 2010, and researched linguistic event extraction at the Nara Institute of Science and Technology (NAIST) in summer 2009. Early PhD work was in legal retrieval, working with Burkhard Schafer in the Edinburgh School of Law.
Notice on NTCIR-6 patent retrieval corpora: My research on legal retrieval using the NTCIR patent dataset highlighted that of ~33,000 patents identified as relevant to 1000 sample queries, 1,396 patents are missing from the source collection. This error has been confirmed by the NTCIR organisers, and the list of missing documents can be found here.
Language engineering and classification
My dissertation, supervised by Jon Oberlander, explores clustering song genres using lyrics. Text processing and sentiment analysis techniques were used to extract 140 language features for analysis using Kohonen self-organising maps (SOMs). These maps were evaluated against the clustering of eight hand-selected song pairs. Other projects included: predicting web queries using query log language models, and building a named entity recognition system using a maximum entropy classifier.