Annie Louis


Wikipedia Biography Translation Corpus

This corpus contains the sentence pairs used in the machine translation experiments reported in our paper:

Annie Louis and Bonnie Webber, Structured and Unstructured Cache Models for SMT Domain Adaptation, Proceedings of EACL, 2014. [pdf]


Overview

The corpus contains French to English translations of biographies compiled from Wikipedia. We collected articles which are marked with a “Translation template” in Wikipedia metadata. These markings indicate a page which is translated from a corresponding page in a different language and also contains a link to the source article. (Note that these article pairs are not those written on the same topic separately in the two languages.)

For an example: Here is the English Wikipedia biography of the French actor "Jocelyn Quivrin". This article is a translation of this French article. The metadata indicating that the English version is a translation appears on the Talk page of the English article.

We collect pairs of French-English pages with this template and filter those which do not belong to the Biography topic (using Wikipedia metadata). Note, however, that these article pairs are not very close translations. During translation an editor may omit or add information and also reorganize parts of the article. So we filter out the paired documents which differ significantly in length. We use LFAligner to create sentence alignments for the remaining document pairs. We constrain the alignments to be within documents but since section headings were not maintained in translations, we did not further constrain alignments within sections. We manually corrected the resulting alignments and keep only documents which have good alignments and have manually marked topic segments (Wikipedia section headings). Unaligned sentences were filtered out. The articles are 12 to 87 sentences long and contain 5 topic segments on average.


Data

Our EACL paper splits this data into a tuning corpus containing 430 sentence pairs and a test corpus containing 1008 sentence pairs.

Download them here: tuningCorpus.txt and testCorpus.txt

The format of each file is as follows. There are 5 tab-separated columns.

  1. A document identifier. Sentences with the same identifier belong to the same document.
  2. A sentence number within the document which can used to obtain the ordering of sentences. Note that the numbers are not contiguous since sentences that did not align were filtered.
  3. The Wikipedia section heading under which the sentence appears.
  4. French sentence
  5. English translation of the French sentence.


Contact

Please send any questions or comments to:

Annie Louis
alouis@inf.ed.ac.uk