Software and Data SetsFor much of my career, I've needed to develop new ways of annotating language corpora and searching the annotations. I release software and data openly wherever possible. This page provides links to some of the more popular software and data sets that were the result.
AMI Meeting Corpus
The AMI Meeting Corpus contains 100 hours of multimodal meeting recordings, 70% of which use a standardized design team role-play to obtain uniform data, with the rest naturally occurring meetings. The role play data is annotated for a very wide range of verbal and non-verbal behaviours. The corpus was released under a Creative Commons ShareAlike License in June 2006. It is aimed at a wide audience from speech and video processing to organizational psychologists and linguists. There is an auxiliary data set, the AMIDA Meeting Corpus, that contains a smaller amount of data that is similar, but where one person is collaborating with a face-to-face group from a remote location. The projects that produced them are described on the AMI Project website. As well as numerous individual projects, the new European "network of excellence" in social signal processing plans to use it for some of their work.
Switchboard Corpus in NXT Format
As part of supporting a set of projects at Edinburgh that were all using the Switchboard Corpus, we've pulled together as many Switchboard annotations as we could find and put them into NXT format, as well as authoring a few of our own. The Linguistic Data Consortium is distributing them under a Creative Commons Share-Alike license, which means that anyone who has them is free to distribute them under the same terms. Local Informatics users can find them at /group/corpora/public/switchboard/nxt/.There is a website describing the annotations and how to use them, with contact details for anyone with questions. There is also a recent (2010) journal paper about the data release. Please get in touch if you have other annotations you want to contribute, especially if they arise under the "Share-Alike" license condition. Where possible, we intend to add them to the current data set so that everything can be distributed together.
The NITE XML Toolkit
The NITE XML Toolkit is open source software that supports the development and analysis of multimodal language corpora. Using a data model that allows annotations to relate structurally and temporally, it provides library functions (in Java) for data handling, query (using a language designed to match the data model), and interface components. It comes with a number of configurable end user interfaces for common tasks like dialogue act and named entity annotation. Although many of its features relate to signals, some people use it on text corpora to support unusual annotations or several kinds of annotation at once.
HCRC Map Task Corpus
The HCRC Map Task Corpus is quite old now, but people still find it useful because it's one of the few dialogue corpora to have a wide range of annotations all available in one place. The website includes an NXT format release of all of the existing annotations and the audio, which was previously available only on CD. It's impossible to get a full sense of its users - they're mostly engaged in research and teaching linguistics, but I've even seen it used as a resource for teaching English as a foreign language.