Simple English Wikipedia data sets

Wikipedia simplification data set using revisions

This data set contains revision histories made to Simple English Wikipedia, where the author of the revision has marked it as a simplification or grammatical correction. The archive contains 14,831 pairs of revision edits, totalling over 80,000 sentences. The data was taken from the Simple English Wikipedia database dated 2010-09-13. A complete description of the extraction process can be found in Woodsend and Lapata (2011, EMNLP).

Wikipedia revisions data (tar.gz)

Inside the archive, each revision is stored as a pair of files: *.old and *.new. The revised sentences are stored in the .new file (only the new sentences, not the complete article). We have identified sentences in the previous revision of that article that have been modified, and these are stored in the .old file with the same filename.

Sentence splitting has been done, so each sentence is a separate line. The text has also been tokenized to split away punctuation characters, and the output should be compatible with the Stanford parser.

In addition, many of the revisions have a third file *.lines. We have aligned individual sentences in the .old and .new files where there is significant content overlap. The .lines file shows the sentence alignments: each line gives an alignment, in the format:

<.old line number> \t <.new line number>
Line numbers start from zero. Note that a sentence can align to more than one revised sentence, as it is common when simplifying to split a long sentence into several shorter ones.

Wikipedia simplification data set using alignments

This data set contains alignments made between Simple English Wikipedia and corresponding articles in the Main English Wikipedia. The archive contains aligned sentences from 15,000 articles, over 3,250,000 sentences. The data was taken from the Simple English Wikipedia database dated 2010-09-13, and Main English Wikipedia of 2010-09-16. This does not represent all the alignments that could be made from these databases; we chose to truncate the data set to make it comparable (in terms of source data) to the revisions data set above. To align the sentences within the paired articles, we identified aligned sentences using macro alignment (at paragraph level) then micro alignment (at sentence level), using tf.idf scores to measure similarity (Barzilay and Elhadad 2003; Nelken and Schieber 2006). A complete description of the extraction process can be found in Woodsend and Lapata (2011, EMNLP).

Wikipedia alignments data (tar.gz)

Inside the archive, there is a pair of files aligned.main and aligned.simple, containing sentences from MainEW and SimpleEW articles respectively. Sentence splitting has been done and the text tokenized as with the above data set, so each sentence is a separate line. The two files have the same number of lines, and the n-th line of the two files correspond.

Wikipedia QTSG simplification rules

This set of files contains the quasi-synchronous tree substitution grammar (QTSG) rules that we learned from the revision histories dataset above. Again, please see Woodsend and Lapata (2011, EMNLP) for a complete description of how we extracted these rules.

Wikipedia QTSG simplification rules (tar.gz)

The rules have been written out to text files as synchronous pairs of productions. A line contains a source production and a target production, separated by a tab character. The root non-terminal within a rule is given in this format:
< non-terminal symbol > / < dependency label >
where the symbols and dependency labels are those produced by the Stanford parser. In our implementation, we actually ignore the dependency labels.

For the produced non-terminals in the rule, we augment the format to:
< non-terminal symbol > / < dependency label > # < link number or terminal >
The link numbers link up nonterminal symbols on the source side with nonterminal symbols on the target side, the indices starting from 1. The number 0 indicates no link; generally they indicate that the node on the source side will be deleted. If a word or punctuation terminal follows the # character, we match the terminal exactly to apply the rule.

There are three files inside the archive:

Please contact me (Kristian Woodsend) if you have questions regarding these data set.