Annie Louis: Software and Resources


A corpus of general and specific sentences

This corpus contains the annotations obtained for our paper

Annie Louis and Ani Nenkova, Automatic identification of general and specific sentences by leveraging discourse annotations, Proceedings of IJCNLP, 2011.

Overview

The files below contain annotations and automatic classifier predictions for general or specific nature of sentences.

The annotations were obtained using Amazon Mechanical Turk for sentences from three different corpora: Wall Street Journal, Associated Press and the science section of New York Times. We also developed an automatic classifier that can make the binary distinction between general and specific sentences with 75% accuracy. The predictions using the best set of features are also included in these data files.

More details about the annotations and classifier can be found in our paper.

Data

The files below are one for each corpus that we had in our annotation set. There are approximately 300 sentences in each of them. Each sentence was annotated by 5 judges. The files contain tab-separated columns and the fields are the following:

  1. A global identifier number
  2. Filename
  3. Sentence number in the file (starts from 0)
  4. Sentence enclosed within quotes
  5. Binary prediction from the classifier (1 indicates general, -1 specific). The nonlexical features from our IJCNLP paper were used in the classifier.
  6. Predicted classifier confidence for the general class
  7. Confidence for specific class
  8. Majority class given by annotators (gen - general, spec - specific)
  9. Number of annotators who agreed on the majority class. (When this value is a 2, it indicates no majority.)
  10. Number of annotators who assigned the class "general"
  11. Number of annotators who assigned the class "specific"
  12. Number of annotators who assigned the class "unknown"

Contact

Please send any questions or suggestions to:

Annie Louis
alouis at inf dot ed dot ac dot uk