[back]

Here you can find a multilingual parallel corpus created from translations of the Bible. This an effort to create a parallel corpus containing as many languages as possible that could be used for a number of NLP tasks. Using the Book, Chapter and Verse indices the corpus is aligned (almost) at a sentence level. (There are cases where two verses in one language are translated as one in another)


Following a similar effort by Philip Resnik and Mari Broman Olsen at the University of Maryland (website) I have encoded the text of each language in XML files using the Corpus Encoding Standard


The following table contains the XML Bibles in 56 languages (all the languages that an electronic version was freely available online) along with information about each language from Ethnologue

You can get all the files here (XML_Bibles.tar)


Corpus Statistics
Total languages56
Non Latin script15
<1M speakers14
Non Indo-European30
Partial Texts17

Click here for a map of the geographical distribution of the languages
ISO 639-3 Language Family Genus Subgenus Speakers Script Full Parts
afr Afrikaans Indo-European Germanic West 5,000,000 Latin Y
als Albanian Indo-European Albanian Tosk 3,000,000 Latin Y
amh Amharic Afro-Asiatic Semitic South 17,500,000 Ethiopic N New Testament
arb Arabic Afro-Asiatic Semitic Central 206,000,000 Arabic Y
hye Armenian Indo-European Armenian 6,400,000 Armenian N Gen. Exod. Gosp.
eus Basque Basque 700,000 Latin N New Testament
bul Bulgarian Indo-European Slavic South 9,000,000 Cyrillic Y
ceb Cebuano Austronesian Malayo-Polynesian Phillipine 15,800,000 Latin Y
cha Chamorro Austronesian Malayo-Polynesian Chamorro 92,000 Latin N Psalm Gosp. Acts
cmn Chinese Sino-Tibetan Sinitic Chinese 840,000,000 Chinese Y
cop Coptic Afro-Asiatic Egyptian Extinct Coptic N New Testament
hrv Croatian Indo-European Slavic South 5,500,000 Latin Y
ces Czech Indo-European Slavic West 9,500,000 Latin Y
dan Danish Indo-European Germanic North 5,500,000 Latin Y
eng English Indo-European Germanic West 328,000,000 Latin Y
epo Esperanto Constructed 1000 Latin Y
est Estonian Uralic Finno-Ugric Finno-Permic 1,000,000 Latin Y
fin Finnish Uralic Finno-Ugric Finno-Permic 5,000,000 Latin Y
fra French Indo-European Italic Romance 58,000,000 Latin Y
gla Gaelic (Scottish) Indo-European Celtic Insular 67,000 Latin N Gospel of Mark
deu German Indo-European Germanic West 90,300,000 Latin Y
ell Greek Indo-European Greek Attic 13,000,000 Greek Y
hat Haitian Creole Creole 7,700,000 Latin Y
heb Hebrew Afro-Asiatic Semitic Central 5,300,000 Hebrew Y
hun Hungarian Uralic Finno-Ugric Ugric 12,500,000 Latin Y
ind Indonesian Austronesian Malayo-Polynesian Malayo-Sumbawan 23,100,000 Latin Y
ita Italian Indo-European Italic Romance 61,700,000 Latin Y
jpn Japanese Japonic 122,000,000 Kanjii Y
kab Kabyle Afro-Asiatic Berber Northern 3,100,000 Latin N New Testament
kor Korean Altaic(?) 66,300,000 Hangul Y
lat Latin Indo-European Italic Latino-Faliscan Extinct Latin Y
lav Latvian Indo-European Baltic Eastern 1,500,000 Latin N New Testament
lit Lithuanian Indo-European Baltic Eastern 3,100,000 Latin Y
glv Manx Indo-European Celtic Insular 77,000 Latin N Esth. Jonah Gosp.
mri Maori Austronesian Malayo-Polynesian Central-Eastern 60,000 Latin Y
mya Myanmar (Burmese) Sino-Tibetan Tibeto-Burman Lolo-Burmese 32,300,000 Myanmar Y
nor Norwegian Indo-European Germanic North 4,600,000 Latin Y
por Portuguese Indo-European Italic Romance 178,000,000 Latin Y
pot Potawatomi Algic Algonquian Central 1,300,000 Latin N Matthew Acts
rmn Romani Indo-European Indo-Iranian Indo-Aryan 710,000 Latin N New Testament
ron Romanian Indo-European Italic Romance 23,400,000 Latin Y
rus Russian Indo-European Slavic East 143,000,000 Cyrillic Y
srp Serbian Indo-European Slavic South 7,000,000 Latin Y
spa Spanish Indo-European Italic Romance 328,000,000 Latin Y
swh Swahili Niger-Congo Atlantic-Congo Volta-Congo 788,000 Latin N New Testament
swe Swedish Indo-European Germanic North 8,300,000 Latin Y
arc Syriac Afro-Asiatic Semitic Central Extinct Syriac N New Testament
tgl Tagalog Austronesian Malayo-Polynesian Phillipine 23,900,000 Latin Y
ttq Tamajaq (Tuareg) Afro-Asiatic Berber Tamasheq 640,000 Latin N Portions
tha Thai Tai-Kadai Kam-Tai Be-Tai 20,300,000 Thai Y
tur Turkish Altaic Turkic Southern 50,000,000 Latin Y
ukr Ukranian Indo-European Slavic East 37,000,000 Cyrillic N New Testament
ppk Uma Austronesian Malayo-Polynesian Celebic 20,000 Latin N New Testament
vie Vietnamese Austro-Asiatic Mon-Khmer Viet-Muong 68,600,000 Latin Y
wol Wolof Niger-Congo Atlantic-Congo Atlantic 4,000,000 Latin N New Testament
xho Xhosa Niger-Congo Atlantic-Congo Volta-Congo 7,800,000 Latin Y