[back]

Here you can find a multilingual parallel corpus created from translations of the Bible. This an effort to create a parallel corpus containing as many languages as possible that could be used for a number of NLP tasks. Using the Book, Chapter and Verse indices the corpus is aligned (almost) at a sentence level. (There are cases where two verses in one language are translated as one in another)


Following a similar effort by Philip Resnik and Mari Broman Olsen at the University of Maryland (website) I have encoded the text of each language in XML files using the Corpus Encoding Standard


The following table contains the XML Bibles in 100 languages (all the languages that an electronic version was freely available online) along with information about each language from Ethnologue


Click on any column title to re-sort the table
ISO 639-3 Language Family Genus Subgenus Speakers Script Full Parts
acu Achuar-Shiwiar Jivaroan 5,000 Latin N New Testament
afr Afrikaans Indo-European Germanic West 5,000,000 Latin Y
agr Aguaruna Jivaroan 38,300 Latin N New Testament
ake Akawaio Carib Northern East-West Guiana 4,500 Latin N New Testament
als Albanian Indo-European Albanian Tosk 3,000,000 Latin Y
amh Amharic Afro-Asiatic Semitic South 17,500,000 Ethiopic N New Testament
amu Amuzgo Oto-Manguean Amuzgoan 23,000 Latin N New Testament
arb Arabic Afro-Asiatic Semitic Central 206,000,000 Arabic Y
hye Armenian Indo-European Armenian 6,400,000 Armenian N Gen. Exod. Gosp.
djk Aukan Creole English based Atlantic 15,500 Latin N New Testament
bsn Barasana-Eduria Tucanoan Eastern Tucanoan Central 1,890 Latin N New Testament
eus Basque Basque 700,000 Latin N New Testament
bul Bulgarian Indo-European Slavic South 9,000,000 Cyrillic Y
cjp Cabécar Chibchan Talamanca 8,840 Latin N New Testament
cak Cakchiquel Mayan Quichean Greater Quichean 132,000 Latin N New Testament
cni Campa (Asháninka) Arawakan Maipuran Southern Maipuran 26,100 Latin N New Testament
kbh Camsá Equatorial (?) 4,770 Latin N New Testament
ceb Cebuano Austronesian Malayo-Polynesian Phillipine 15,800,000 Latin Y
cha Chamorro Austronesian Malayo-Polynesian Chamorro 92,000 Latin N Psalm Gosp. Acts
chr Cherokee Iroquoian Southern Iroquoian 16,400 Cherokee N New Testament
chq Chinantec (Quiotepec) Oto-Manguean Chinantecan 8,000 Latin N New Testament
cmn Chinese Sino-Tibetan Sinitic Chinese 840,000,000 Chinese Y
cop Coptic Afro-Asiatic Egyptian Extinct Coptic N New Testament
hrv Croatian Indo-European Slavic South 5,500,000 Latin Y
ces Czech Indo-European Slavic West 9,500,000 Latin Y
dan Danish Indo-European Germanic North 5,500,000 Latin Y
dik Dinka Nilo-Saharan Eastern Sudanic Nilotic 450,000 Latin N New Testament
eng English Indo-European Germanic West 328,000,000 Latin Y
epo Esperanto Constructed 1000 Latin Y
est Estonian Uralic Finno-Ugric Finno-Permic 1,000,000 Latin Y
ewe Ewe Niger-Congo Atlantic-Congo Volta-Congo 2,250,000 Latin N New Testament
pes Farsi (Persian) Indo-European Indo-Iranian Iranian 22,000,000 Arabic Y
fin Finnish Uralic Finno-Ugric Finno-Permic 5,000,000 Latin Y
fra French Indo-European Italic Romance 58,000,000 Latin Y
gla Gaelic (Scottish) Indo-European Celtic Insular 67,000 Latin N Gospel of Mark
gbi Galela West Papuan North Halmahera Galela-Loloda 79,000 Latin N New Testament
deu German Indo-European Germanic West 90,300,000 Latin Y
ell Greek Indo-European Greek Attic 13,000,000 Greek Y
guj Gujarati Indo-European Indo-Iranian Indo-Aryan 45,500,000 Gujarati N New Testament
hat Haitian Creole Creole 7,700,000 Latin Y
heb Hebrew Afro-Asiatic Semitic Central 5,300,000 Hebrew Y
hin Hindi Indo-European Indo-Iranian Indo-Aryan 180,000,000 Devanagari Y
hun Hungarian Uralic Finno-Ugric Ugric 12,500,000 Latin Y
isl Icelandic Indo-European Germanic North 230,000 Ethiopic Y
ind Indonesian Austronesian Malayo-Polynesian Malayo-Sumbawan 23,100,000 Latin Y
ita Italian Indo-European Italic Romance 61,700,000 Latin Y
jai Jakalteko Mayan Kanjobalan-Chujean Kanjobalan 77,700 Latin N New Testament
jpn Japanese Japonic 122,000,000 Kanjii Y
quc K'iche' Mayan Quichean-Mamean Greater Quichean 1,900,000 Latin N New Testament
kab Kabyle Afro-Asiatic Berber Northern 3,100,000 Latin N New Testament
kan Kannada Dravidian Southern Tamil-Kannada 35,300,000 Kannada Y
kor Korean Altaic(?) 66,300,000 Hangul Y
lat Latin Indo-European Italic Latino-Faliscan Extinct Latin Y
lav Latvian Indo-European Baltic Eastern 1,500,000 Latin N New Testament
lit Lithuanian Indo-European Baltic Eastern 3,100,000 Latin Y
dop Lukpa Niger-Congo Atlantic-Congo Volta-Congo 50,000 Latin N New Testament
plt Malagasy Austronesian Malayo-Polynesian Greater Barito 7,520,000 Latin Y
mal Malayalam Dravidian Southern Tamil-Kannada 35,400,000 Malayalam Y
mam Mam Mayan Quichean-Mamean Greater Mamean 200,000 Latin N New Testament
glv Manx Indo-European Celtic Insular 77,000 Latin N Esth. Jonah Gosp.
mri Maori Austronesian Malayo-Polynesian Central-Eastern 60,000 Latin Y
mar Marathi Indo-European Indo-Iranian Indo-Aryan 68,000,000 Devanagari Y
mya Myanmar (Burmese) Sino-Tibetan Tibeto-Burman Lolo-Burmese 32,300,000 Myanmar Y
nhg Nahuatl (Tetelcingo) Uto-Aztecan Southern Uto-Aztecan Aztecan 3,500 Latin N New Testament
nep Nepali Indo-European Indo-Iranian Indo-Aryan 11,100,000 Devanagari Y
nor Norwegian Indo-European Germanic North 4,600,000 Latin Y
ojb Ojibwa Algic Algonquian Central 20,000 Aboriginal Syllabics N New Testament
pck Paite (Chin) Sino-Tibetan Tibeto-Burman Kuki-Chin-Naga 78,800 Latin Y
pol Polish Indo-European Slavic West 36,600,000 Latin Y
por Portuguese Indo-European Italic Romance 178,000,000 Latin Y
pot Potawatomi Algic Algonquian Central 1,300,000 Latin N Matthew Acts
kek Q'eqchi' Mayan Quichean-Mamean Greater Quichean 400,000 Latin Y
quw Quichua Quechuan Quechua II B 20,000 Latin N New Testament
rmn Romani Indo-European Indo-Iranian Indo-Aryan 710,000 Latin N New Testament
ron Romanian Indo-European Italic Romance 23,400,000 Latin Y
rus Russian Indo-European Slavic East 143,000,000 Cyrillic Y
srp Serbian Indo-European Slavic South 7,000,000 Latin Y
jiv Shuar (Jivaro) Jivaroan 46,700 Latin N New Testament
slk Slovak Indo-European Slavic West 4,610,000 Latin Y
slv Slovene Indo-European Slavic South 1,730,000 Latin Y
som Somali Afro-Asiatic Cushitic East 8,340,000 Latin Y
spa Spanish Indo-European Italic Romance 328,000,000 Latin Y
swh Swahili Niger-Congo Atlantic-Congo Volta-Congo 788,000 Latin N New Testament
swe Swedish Indo-European Germanic North 8,300,000 Latin Y
arc Syriac Afro-Asiatic Semitic Central Extinct Syriac N New Testament
shi Tachelhit Afro-Asiatic Berber Northern 3,000,000 Arabic N New Testament
tgl Tagalog Austronesian Malayo-Polynesian Phillipine 23,900,000 Latin Y
ttq Tamajaq (Tuareg) Afro-Asiatic Berber Tamasheq 640,000 Latin N Portions
tel Telugu Dravidian South-Central Telugu 69,600,000 Telugu Y
tha Thai Tai-Kadai Kam-Tai Be-Tai 20,300,000 Thai Y
tur Turkish Altaic Turkic Southern 50,000,000 Latin Y
ukr Ukranian Indo-European Slavic East 37,000,000 Cyrillic N New Testament
ppk Uma Austronesian Malayo-Polynesian Celebic 20,000 Latin N New Testament
usp Uspanteco Mayan Quichean-Mamean Greater Quichean 3,000 Latin N New Testament
vie Vietnamese Austro-Asiatic Mon-Khmer Viet-Muong 68,600,000 Latin Y
wal Wolaytta Afro-Asiatic Omotic North 1,230,000 Ethiopic N New Testament
wol Wolof Niger-Congo Atlantic-Congo Atlantic 4,000,000 Latin N New Testament
xho Xhosa Niger-Congo Atlantic-Congo Volta-Congo 7,800,000 Latin Y
dje Zarma Nilo-Saharan Songhai Southern 2,350,000 Latin Y
zul Zulu Niger-Congo Atlantic-Congo Volta-Congo 9,980,000 Latin N New Testament
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.