Data Intensive Acquisition of Rhetorical and Temporal Structure (DART)

I was PI on DART, funded by ESPRC, from 2001--2004

Together with Mirella Lapata and Caroline Sporleder, I have designed, implemented and evaluated statistical models of the discourse structure of narrative text and the temporal order of its events.

To overcome supervised learning over sparse data, we use a combination of unsupervised learning and supervised learning with automatically labelled training examples that are captured from massive online resources such as the Web and the BNC. Our models exploit the relationship between rhetorical relations and discourse cue phrases (e.g., but indicates Contrast, and because indicates Explanation). We use probabilistic modelling to combine multiple sources of linguistic knowledge for estimating discourse structure; combining the features is done automatically by training on large corpora such as the BNC. The approach therefore deals with domain independent narrative text.

SDRT informs this work in a number of ways. It provides the basis for more accurate smoothing over sparse data, for instance. SDRT has also provided the basis for selecting and motivating which features to include in the model.