Nadir Durrani


 

University of Edinburgh, School of Informatics, 10 Crichton Street, Informatics Forum, EH8 9AB, Edinburgh, United Kingdom, Office: Informatics Forum 3.51, Phone: +44 (0) 131 650 5193

Email: dnadir@ Domain: inf.ed.ac.uk


Interlaken 2012

 

 

 

Research Associate

School of Informatics

University of Edinburgh

I am was a Research Associate at the Institute for Language, Cognition and Computation at the University of Edinburgh. I worked with Philipp Koehn. I was working on different aspects of Statistical Machine Translation, such as integrating unsupervised transliteration module into SMT. I have also been working on incorporating Markov-based translation models into phrasal MT. 

IMPORTANT: this web page will no longer be updated - please see my new web page here

 

My research interests include:

 

Natural Language Processing: statistical machine translation, pivoting and triangulation in MT, machine transliteration, word segmentation, studying different writing scripts and their complexities in terms of font development, normalization, language collation.

 

Localization: localization of open source operating systems (desktop environment KDE/GNOME), word processing applications (Open Office), chatting tools (Psi), web browsers (Sea Monkey, Mozilla) and web development tools (NVU), localization of domain names

 

My PhD dissertation has been selected as the best language technology/computational linguistic dissertation for the years 2011-2013 by the German Society for Computational Linguistics. The award was given at KOVENS 2014 where I was invited to give a short presentation.

 

 

Publications

 

What I do?

 

Transliteration Modeling

 

Transliteration has shown to be quite useful in MT for translating OOV words, for disambiguation and for translating closely related languages. However, despite this utility, a transliteration module does not exist in the commonly used MT toolkits such as Moses. The main bottleneck is that the training corpus (transliteration pairs) are not readily available. Secondly if such a data is available, mechanisms to integrate transliterated words into MT pipelines are unavailable in these toolkits. Generally, a supervised transliteration system is trained separately outside of an MT pipeline, and a naive approach, to replace OOV words with their 1-best transliterations in the post/pre-processing step of decoding is commonly used. We have integrated a transliteration module into Moses, it is totally unsupervised and language independent. It extracts transliteration corpus from the parallel data and builds a transliteration model from it which is then used to translate OOV words. We provide two ways to integrate transliterations into decoder. Read out more here. Add the following to the config file and you are done.

 
transliteration-module = "yes"
post-decoding-transliteration = "yes"
language-model-file = /path to language model file/
 

This will enable Method 2 (as defined in the paper).

 

To use the in-decoding method (Method 3 as described in the paper) add the following lines

 

in-decoding-transliteration = "yes"

transliteration-file = /file containing list of words to be transliterated/

 

Post-decoding method obtains the list of OOV words automatically by running the decoder. The in-decoding method requires the user to provide the list of words to be transliterated. This gives a freedom to transliterate any additional words that might be known to the translation model but can also be transliterated in some scenarios. For example "Little" can be translated to "چھوٹی" in Urdu when it is used as adjective and transliterated to "لٹل" when it is a name as in "Stuart Little". You can add the OOV list as obtained from Method 2 if you don't want to add any other words.

 

Markov Based Translation Models

 

One of my other works is focused on exploring Markov-based translation models within other SMT frameworks. We have a pair (1, 2) of papers showing how Markov-based translation models over minimal translation units (MTUs) combined with Phrase-based paradigm can significantly improve translation quality. We significantly improved a submission quality baseline across 8 European Language pairs, by integrating the Operation Sequence Model (OSM) into phrase-based decoder. Read more here. And while you are at it, also read Hui Zhang’s paper at NAACL-13 that explores multiple decomposition order when learning Markov models over MTUs.

 

OSM model has been integrated into Moses and can be used by downloading the recent version. Just put the following in the config file:

 

operation-sequence-model = "yes"

operation-sequence-model-order = 5

operation-sequence-model-settings = ""

 

 

### Tips – if you have time for extra experiments:

 

                     1) Using higher stack size sometimes helps OSM model, keep increasing beam

                     2) Using higher order such as 9 helps sometimes but try 5 first

                     3) Also try using higher distortion limit with OSM. I got some gains in my DE-EN experiments

 

I like to participate in Machine translation competitions. Click here to see how we did on automatic and human evaluation in WMT-13, IWSLT-13 and WMT-14.

 

OSM was acknowledged as one of the prominent approaches that have led to actual improvements in systems in the evaluation campaign. I was invited to briefly present the model at the ninth workshop of statistical machine translation. Here are my slides along with other models that were acknowledged in the same capacity.

 

Hinrich Schütze presented OSM model as his invited talk at MT Summit 2013.

 

Class Based and Generalized OSM Models

 

Automatically clustering the training data into word classes in order to obtain smoother distributions and better generalizations has been a widely known and applied technique in natural language processing. Using class-based models have shown to be useful when translating into morphologically rich languages. We use GIZA’s “mkcls” utility to cluster the data into classes.  POS and Morph tags are also found to be useful when used in OSM model. Find details in this paper. The instructions on how to enable generalized representations in OSM model are given at this page.

 

http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc4

 

History

 

I pursued my doctoral studies at the Institute of Natural Language Processing (IfNLP), University of Stuttgart, and graduated in Nov 2012. I was advised by Alex Fraser, Helmut Schmid and Hinrich Schütze. We proposed a novel model for Statistical Machine Translation. You can grab a copy of my doctoral thesis here or you can read a shorter version.

 

I did an internship with Yaser Al Onaizan and Abe ittycheriah at IBM T.J Watson Center. I worked on the front-end processing of Egyptian blogs. The overall goal was to improve Egyptian-to-English Machine Translation System by mapping Egyptian words into MSA words. Here’s more on this story.

 

At Stuttgart, I also worked on integrating transliteration into SMT framework. Transliteration has much more to offer than translating OOVs and NEs when it comes to closely related language pairs such as Hindi-Urdu, Thai-Lao etc. that written differently but spoken with a huge vocabulary overlap. Here’s more on this story.

 

 

Pre-Stuttgart

 

Before coming to Stuttgart, I was at the Center for Research in Urdu Language Processing (CRULP), now known was Center for Language Engineering. I did my Masters and Bachelors from National University of Computer and Emerging Sciences. My thesis was about Urdu Word Segmentation, was supervised by Prof. Sarmad Hussain. Here’s a copy of my Master’s thesis.

 

I worked for the PAN Localization Project for 3 years. The overall objective of this project was to examine effective means to develop digital literacy through the use of local language computing. We localized commonly used desktop utilities such as word processors, browsers and chat tools into Urdu and 7 other languages of Asia. 

 

I worked for STEA (Science Technology and Environment Agency) in Laos for 5 months and worked on basic language processing tasks such as line breaking, word collation, spell checking etc. Find more on this my publication list.

 

Here’s a periodically updated CV and a list of my publications along with slides and posters. I have Google Citation page.

 

I sometimes play chess here with an id called durrani.