2000, 15MB, download v0.91, public domain.
This corpus has been adapted from the de-news web site. Volunteers collected about five to ten news items per day from German radio broadcast and translated them into English. The translation quality varies, but it is overall very good. We processed the corpus into a format that is more suitable for machine translation research.
The goals of the processing was to generate sentence-aligned text. For this purpose we extracted matching news items and labeled them with corresponding document IDs. Using a preprocessor we separated out punctuation and identified sentence boundaries. Sentence alignment took place using tool based on the Church and Gale algorithm.
We spent only about one hour per month of data manually editing the corpus to check that news items match up. For sure, there are still misaligned news items and inadequate translations. Additional editing would be helpful. Please contact us, if you would be willing to assist us with this.