The goal of Statistical Language Modeling is to build a statistical language model that can estimate the distribution of natural language as accurate as possible. A statistical language model (SLM) is a probability distribution P(s) over strings S that attempts to reflect how frequently a string S occurs as a sentence.

By expressing various language phenomena in terms of simple parameters in a statistical model, SLMs provide an easy way to deal with complex natural language in computer.

The original (and is still the most important) application of SLMs is speech recognition, but SLMs also play a vital role in various other natural language applications as diverse as machine translation, part-of-speech tagging, intelligent input method and Text To Speech system.

N-gram model is the most widely used SLM today.

Without loss of generality we can express the probability of a string s:
p(s) as

p(s) = p(w1)p(w2|w1)p(w3|w1w2)...p(wl|w1...wl-1) = prod_i^l(p(wi|w1...wi-1))

In bigram models, we make the approximation that the probability of a
word only depends on the identity of the immediately preceding word, hence we
can approximate p(s) as:

p(s) = prod_i^l p(wi|wi-1)The parameters in a traditional N-gram model can be estimated with (Maximum Likelihood Estimation) MLE technique:

C(wi-1, wi) p(wi|wi-1) = -------------- C(wi-1)

N-gram models have received intensive research since its invention, several enhanced N-gram models have been proposed. Here are some typical extensions to traditional N-gram model:

- Class-based N-gram model
- Grammatical Trigrams
- Sequence N-gram model

In order to cope with the data sparseness problem, class-based
N-gram model was proposed. Instead of dealing with separated words,
class-based N-gram estimates parameters for *word classes*. By
clustering words into classes, a class-based N-gram model can reduce
the model size significantly with the cost of slightly higher
perplexity.

Grammatical Trigrams is a new class of language models that
incorporates Link Grammar in a generative probabilistic model of the
form:

Pr(S, L) = Pr(W0, d0) prod Pr(W, d, O | L, R, l, r)

where L = {(W, d, O, L, R, l, r)} is a set of links and d0 is an
initial disjunct. The MLE solution of the model can be estimated by a
variant of EM algorithm.

Sequence N-gram is an attempt to extend N-gram models with variable
length sequences. A sequence can be a sequence of word, word class,
part-of-speech or whatever a sequence of *something* that the
modeler believes bearing important grammar information.

to be written

Maximum Entropy (ME) model is an elegant and general statistical model that can incorporate features from different sources freely. A conditional ME model has the form:

1 p(w|h) = ----- * exp [sum(lambda fi(h, w))] Z(h)where lambda are parameters Z(h) is a normalization factor and fi(h, w) are arbitrary functions of the word history pair. A ME model that incorporates trigram, distance N-gram, trigger pairs was observed more than 30% perplexity reduction over baseline trigram model. (Rosenfeld, 1996). ME model can also be trained to incorporates topic information.

The major drawback of conditional ME models is the huge computation
involved, making it infeasible to handle large corpora. To address this
issue and to overcome the inherent disadvantage of *chain role*
Rosefield and his group at CMU proposed a whole sentence exponential model:

1 P(s) = ----- * P0(s) * exp [sum(lambda fi)] ZHere Z is a normalization constant and the lambda can be estimated via sampling.

- CMU-Cambridge Statistical Language Modeling toolkit (has not been updated for years)
- SRI Language Modeling Toolkit (contains up-to-date SLM techniques, well maintained)
- N-gram stat
- Trigger Toolkit
- My N-gram extraction tool

Some recommended papers on SLM technique, only papers that have on-line electrical version are listed. (TODO: sort papers based on their categories)

- Two Decades Of Statistical Language Modeling: Where Do We Go From Here?
- A Maximum Entropy Approach to Adaptive Statistical Language Modeling
- A Maximum Entropy Language Model Integrating N-Grams And Topic Dependencies For Conversational Speech Recognition
- A Structured Language Model
- Aggregate and mixed-order Markov models for statistical language processing
- Combining Nonlocal, Syntactic And N-Gram Dependencies In Language Modeling
- Exploiting Syntactic Structure for Language Modeling
- Improvement of a Whole Sentence Maximum Entropy Language Model Using Grammatical Features
- Language Modeling By Variable Length Sequences : Theoretical Formulation And Evaluation Of Multigrams
- Structure And Performance Of A Dependency Language Model
- A Neural Probabilistic Language Model
- Factored Language Models and Generalized Parallel Backoff

To be written.