Intelligent Chinese Input Method

Unlike western languages which have a small alphabet, Chinese has a much larger alphabet consists of several thousand Chinese characters (Han Zi). In order to input Chinese text with the usual keyboard originally designed for inputing English, a coding scheme that maps a series of Latin characters to Chinese text is required. A input method us a software that resides between normal user input interface and the low level system I/O routine, mapping series of key sequences to Chinese text transparently.

There are several widely used coding schemes for inputing Chinese text. The Pin Yin method is the most popular input method among Mandarin speakers. It maps a phonetic (Pin Yin) code to one or more Chinese characters. For instance, the following sample illustrates inputing a sentence using Pin Yin method:

                              Pin Yin Decoding
zhong hua ren min gong he guo ----------------> The People's Republic of China
    

One major deficiency of Pin Yin method is the translation ambiguity: one phonetic code can map up to 100 different Chinese characters. This is not surprising, since the coding scheme needs to represent thousands of different Chinese characters with only 417 different phonetic code. Hence a user must select the character he or she wants during inputing, which greatly slow down the input speed. A good input method should have the ability to resolve translation ambiguity in some intelligent way and minimize the need of human intervention.

Methology

A good Chinese input method should have the following property:

Typically,the process of Phonetic to Chinese conversion consists of two stages:

The first segment problem can be solved using standard Chinese segmentation method. The second problem can be attacked with a Statistical Language Model. The system works much the same way as the lattice scoring component in a speech recognition system: a character lattice is built according to the input phonetic sequence, the a SLM can be employed to choose the best sentence generated from the character lattice.

Currently, a Trigram SLM is built to select the best path from the lattice. The adaptive part of the system is implemented with a Memory-based Learner aims at adjusting the model's parameters according to user's preference on-line. Both Pin Yin and Wu Bi are supported in whole sentence input mode. The input method conforms to the XIM protocol and works as a standalone XIM server under X-Window (Linux and FreeBSD). This software is still in its early stage and no code is available yet. However, you can view some fancy screen-shot:
graphics/wubi.png

Wu Bi input method in action
graphics/pinyin.png

Pin Yin input method in action

TODO List

This project is suspended as of 2004, and probably will not be developed for a long time. The major reason is that I have become an experienced WuBi user and am satisfied with my current inputting speed: 35 - 60 characters per minute. Therefore I lose interest in developing a PinYin solution that is actually much slower than WuBi (at least for myself). If you are a PinYin user and have not used WuBi before, I recommend you have a try, and you will be highly rewarded in the end.