This README file describes the data used in Woodsend and Lapata (ACL, 2010). There are two archives: training.tgz and test.tgz The training.tgz archive contains CNN articles and their corresponding highlights. Each document and its highlight have the same prefix but different suffixes. So for example the highlights for the article tiger.conservation2008.doc.txt are tiger.conservation2008.hlights.txt. Highlights and documents are in different directories (training/hlights and training/docs, respectively). The training files have been annotated as follows. In the beginning of each line there is a number, ranging from 1, 2, and 3. The numbers denote whether the document sentence corresponds to a highlight. Specifically, label (1) means that the sentence must be in the highlights, label (2) that the sentence could be in the highlights and label (3) that the sentence is not in the highlights. The structure of the test.tgz archive is analogous to training.tgz. Again it has two directories (test/doc and test/hlights), each containing files corresponding to the documents and their highlights. The test files have no alignment annotations. The files are all one sentence per line, and tokenized. If you use the data please cite: @InProceedings{woodsend-lapata:2010:ACL, author = {Woodsend, Kristian and Lapata, Mirella}, title = {Automatic Generation of Story Highlights}, booktitle = {Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics}, month = {July}, year = {2010}, address = {Uppsala, Sweden}, publisher = {Association for Computational Linguistics}, pages = {565--574}, url = {http://www.aclweb.org/anthology/P10-1058} }