This website contains complementary material for our paper Robust cross-lingual genre classification through comparable corpora, which was presented at the BUCC Workshop at LREC 2012 in Istanbul, Turkey.

The experiments in the paper use data from the Reuters corpora V1 and V2, the Europarl corpus, and the JRC-ACQUIS corpus.

The scripts we used to extract, select, and clean the data can be found here. In addition, we used the unsupervised Punkt sentence tokenizer implemented in the Natural Language Tool Kit.

The texts we used in our experiments were sampled randomly from the source corpora using the above scripts. The identifiers can be found here: [Europarl] [Reuters] [JRC-ACQUIS]

Authors: Philipp Petrenz and Bonnie Webber