Combilex

Combilex speech technology lexicon

Combilex is an advanced multi-accent lexicon for English, developed specifically for speech technology purposes - primarily text-to-speech synthesis (TTS) and automatic speech recognition (ASR), though the information Combilex provides has also been employed in many other natural language processing (NLP) tasks.

Combilex was developed jointly by Sue Fitt and me, with "Proof-of-Concept" commercialistaion funding from Scottish Enterprise. Combilex inherited some of its aims from the older Unisyn lexicon, and in simple terms it could be thought of as a next generation of the Unisyn technology. In fact, however, Combilex was designed and developed entirely from scratch, going further in a number of respects, and the implementation details are hugely different. Here, I'll mention some of the most important features and advantages of Combilex.

Multiple accents of English

Like Unisyn, Combilex is an accent-independent lexicon. Pronunciations are entered as a "base form" transcription, using metaphone symbols which are a superset of the phones found in multiple accents of English. A transcription in the base form lexicon is a model of how that word will be pronounced in all accents of English simultaneously.

For most (though not all...) speech technology purposes, a "surface form" lexicon is required which is tailored to a specific regional accent, or even to the accent of an individual speaker (e.g. for a high-quality TTS voice). Therefore, a set of rules is used to map from the base form transcriptions to give an arbitrary surface form lexicon. In Unisyn, this was implemented using a sequence of Perl regular expression text manipulations. In Combilex, this is implemented using finite state transducers (FST), compiled from human-devised context-dependent rewrite rules. There are a multiple technical advantages to motivate using FSTs - to give just one example, it means rules can be optional which gives a lot of flexibility and control in forced-alignment labelling.

So, to give an example, the Combilex baseform transcription for "bath" is:

baseform: { b " A#1_a T_th }

From this, we can generate pronunciations for different accents automatically, for example:

British RP: { b " A_a T_th } Gen. American: { b " a T_th } West Country: { b " a f_th } etc.

Currently there are 5 general regional accents available (Received Pronunciation (RP) - or "standard" British English - General American, Edinburgh, Cardiff Welsh, Leeds Northern English). An Indian-accented version is also just about to be developed. In principle, though, it is possible to customise the lexicon to suit any accent.

Morphological derivation

The aim in Combilex is to hand-code only that which is not predictable, and have everything else generated automatically. So, for example the pronunciation for "run" was entered by hand:

run: { r " V_u n }

But, once we have that word, the rules of morphological derivation of English make it possible to predict a large number of related words automatically:

runs: { r " V_u n } z_s > running: { r " V_u n_nn }. I_i N_ng > rerun: < r " i_e .{ r % V_u n } etc.

This brings at least three advantages:

it makes adding words to the lexicon far faster, as we only enter a transcription for "run" and then specify other words as derived from it, such as "{run}s>", "{run}ing>", "<re{run}", "<re{run}s>" etc.
it reduces scope for error and inconsistency, because the human only enters the bare minimum of data
it is easier to maintain the lexicon and make changes that retain consistency, since related words are explicitly linked.

To give some numbers to quantify these advantages somewhat, from around 22k "core" pronunciations which were hand-transcribed, Combilex contains over 100k additional entries that have been derived automatically. In other words, for each word added as a fully-specified core word, it is on average a bit like adding approximately 6 words at once with no extra transcription effort. (Incidentally, there are about 145k words in total contained in Combilex currently).

Explicit grapheme-phoneme linking

Combilex has explicit grapheme-phoneme correspondences for all pronunciations. Looking at the transcription for "bath" above, the links are indicated by the underscores, with the metaphones on the left and the associated graphemes on the right (e.g. "T_th" means a voiceless dental fricative which realises the written letters "th"). Wherever the metaphone and orthographic strings are the same, only one of them is included, as a notational economy (e.g. "b" means the voiced bilabial oral stop which manifests the letter "b" in the spelling). These explicit correspondences have four main uses:

training and/or testing letter-to-sound rules, without guessing the orthographic-phonemic alignment
the orthography can be used as context in the rules to generate a surface form lexicon
the orthography can be used as additional context features in statistical parametric synthesis
the orthography is indispensable for many of the morphological derivation rules mentioned above.

Rich lexical information

In addition to the phone sequence, Combilex lexical entries give rich additional information, including:

primary and secondary stress (denoted by " and % respectively)
syllabification (respecting word morphology and native speaker intuition, rather than based on simple rule)
morphological structure (incl. morpheme boundaries in the phone string, as well as the list of constituent morphemes)
part of speech tags
free variant and headword ordering (e.g. multiple pronunciations of "clematis" ordered by frequency)
source domain and gender tags for names
source language tags for loanwords
geographical locations for pronunciations which vary by place (e.g. some town names in UK compared with the same name for a town in the US)

Data integrity and consistency

Combilex pronunciations have a high level of consistency. Some of this stems from the design philosophy mentioned above, where we minimise the data entered and maintained by a human, and all that is predictable is generated automatically. There are several other factors, though, which help boost the consistency and reliability of Combilex entries. For example, unlike other lexicons, such as CMUDict, which is comprised of pronunciations from multiple sources, Combilex baseform transcriptions have been designed and managed by only two people, and in close cooperation. In addition, the Combilex system has been created as an SQL database, and a system of phonotactic constraints and automatic consistency-checking rules are applied before any pronunciation is added. This all helps guard against the introduction of mistakes and inconsistency due to human error.

I could go on at greater length (I'm quite proud of many aspects of Combilex in fact!), but those are likely to be the main features of interest. Because Combilex was funded as a commercialisation project, it is only available for commercial licensing - unlike Unisyn which is only available for non-commercial use, since it is jointly owned by Nuance. Customers and users include Google, Samsung, Toshiba, NICT, VoiceBase etc., and the Universities of Cambridge, Sheffield, IDIAP, Alberta, Ghent etc.

For more details, there are a few publications available on my lexicography publications page.