University of Edinburgh, School of Informatics, 10 Crichton Street, Informatics Forum, EH8 9AB, Edinburgh, United Kingdom, Office: Informatics Forum 3.51, Phone: +44 (0) 131 650 5193
Email: dnadir@ Domain: inf.ed.ac.uk
IMPORTANT: this web page will no longer be updated - please see my new web page here
My research interests include:
Natural Language Processing: statistical machine translation, pivoting and triangulation in MT, machine transliteration, word segmentation, studying different writing scripts and their complexities in terms of font development, normalization, language collation.
Localization: localization of open source operating systems (desktop environment KDE/GNOME), word processing applications (Open Office), chatting tools (Psi), web browsers (Sea Monkey, Mozilla) and web development tools (NVU), localization of domain names
My PhD dissertation has been selected as the best language technology/computational linguistic dissertation for the years 2011-2013 by the German Society for Computational Linguistics. The award was given at KOVENS 2014 where I was invited to give a short presentation.
What I do?
Transliteration has shown to be quite useful in MT for translating OOV words, for disambiguation and for translating closely related languages. However, despite this utility, a transliteration module does not exist in the commonly used MT toolkits such as Moses. The main bottleneck is that the training corpus (transliteration pairs) are not readily available. Secondly if such a data is available, mechanisms to integrate transliterated words into MT pipelines are unavailable in these toolkits. Generally, a supervised transliteration system is trained separately outside of an MT pipeline, and a naive approach, to replace OOV words with their 1-best transliterations in the post/pre-processing step of decoding is commonly used. We have integrated a transliteration module into Moses, it is totally unsupervised and language independent. It extracts transliteration corpus from the parallel data and builds a transliteration model from it which is then used to translate OOV words. We provide two ways to integrate transliterations into decoder. Read out more here. Add the following to the config file and you are done.
transliteration-module = "yes"
post-decoding-transliteration = "yes"
language-model-file = /path to language model file/
This will enable Method 2 (as defined in the paper).
To use the in-decoding method (Method 3 as described in the paper) add the following lines
in-decoding-transliteration = "yes"
transliteration-file = /file containing list of words to be transliterated/
Post-decoding method obtains the list of OOV words automatically by running the decoder. The in-decoding method requires the user to provide the list of words to be transliterated. This gives a freedom to transliterate any additional words that might be known to the translation model but can also be transliterated in some scenarios. For example "Little" can be translated to "چھوٹی" in Urdu when it is used as adjective and transliterated to "لٹل" when it is a name as in "Stuart Little". You can add the OOV list as obtained from Method 2 if you don't want to add any other words.
Markov Based Translation Models
One of my other works is focused on exploring Markov-based translation models within other SMT frameworks. We have a pair (1, 2) of papers showing how Markov-based translation models over minimal translation units (MTUs) combined with Phrase-based paradigm can significantly improve translation quality. We significantly improved a submission quality baseline across 8 European Language pairs, by integrating the Operation Sequence Model (OSM) into phrase-based decoder. Read more here. And while you are at it, also read Hui Zhang’s paper at NAACL-13 that explores multiple decomposition order when learning Markov models over MTUs.
OSM model has been integrated into Moses and can be used by downloading the recent version. Just put the following in the config file:
operation-sequence-model = "yes"
operation-sequence-model-order = 5
operation-sequence-model-settings = ""
### Tips – if you have time for extra experiments:
1) Using higher stack size sometimes helps OSM model, keep increasing beam
2) Using higher order such as 9 helps sometimes but try 5 first
3) Also try using higher distortion limit with OSM. I got some gains in my DE-EN experiments
I like to participate in Machine translation competitions. Click here to see how we did on automatic and human evaluation in WMT-13, IWSLT-13 and WMT-14.
OSM was acknowledged as one of the prominent approaches that have led to actual improvements in systems in the evaluation campaign. I was invited to briefly present the model at the ninth workshop of statistical machine translation. Here are my slides along with other models that were acknowledged in the same capacity.
Hinrich Schütze presented OSM model as his invited talk at MT Summit 2013.
Class Based and Generalized OSM Models
Automatically clustering the training
data into word classes in order to obtain smoother distributions and better
generalizations has been a widely known and applied technique in natural
language processing. Using class-based models have shown to be useful when
translating into morphologically rich languages. We use
I pursued my doctoral studies at the
I did an internship with Yaser
Al Onaizan and Abe ittycheriah
Before coming to
I worked for the PAN
Localization Project for 3 years. The overall objective of this project
was to examine effective means to develop digital literacy through the use of
local language computing. We localized commonly used desktop utilities such
as word processors, browsers and chat tools into Urdu and 7 other languages
worked for STEA (Science
Technology and Environment Agency) in
I sometimes play chess here with an id called durrani.