My Past Research at

This page summarizes my past research interests at

The Beginning

In the autumn of 2001, I began my life as a graduate student at Northeastern University, having no idea of what is research and what kind of research I'd like to take. Then I consulted Prof. Lu Zhen-de, my undergraduate thesis suvervisor, who thought Prof. Yao Tian-shun's lab may interest me. After calling Prof. Yao, we had a long talk on issues like current research direction, working attitude, and life-long goal in his office. This particular talk actually changed my life. From then on I became a master student at the NLPLab specializing in Computational Linguistics.

Research Interests

A look back and a look ahead

The past few decades have witnessed significant progress in computer science: today's desktop PC has been much powerful than most workstations ten years ago. The boom of the internet, the invention of electrical mail and the increasing popularity of WWW search engine have changed the way millions of people live. The computer software has become so easy to use that even a person who is not a computer literate can enjoy the convenience brought by computer without difficulty.

The step of the advancement of science will never stop. Being at the beginning of the 21st century, computer science, I believe, will continue influence the life patterns of people and we'll lead a better lives tomorrow.

Natural Language Processing

Currently my research interest is Natural Language Processing (NLP). The main object of interest in NLP is natural language such as Chinese or English. The idea of using computer to processing natural language has been formed soon after the invention of the first computer. Yet not until recently can computer be powerful enough to deal with some fundamental problems in natural language such as assigning correct part-of-speech to words. The complex nature of natural language makes NLP a very challenge field. Perhaps the most successful large scale NLP application today is Google, which is a large Information Retrieval system.

Statistical Language Modeling

There are many interesting fields under the name Natural Language Processing such as Information Retrieval (IR), Machine Translation (MT), Text Categorization and Statistical Language Modeling (SLM). I now have particular interest in Statistical Language Modeling.

The goal of statistical language modeling is to build a statistical language model that can estimate the distribution of natural language as accuracy as possible. By expressing various language phenomena in terms of simple parameters in a statistical model, SLMs provide an easy way to handle complex natural language in computer.

The original (and is still the most important) application of SLMs is speech recognition, but SLMs also play a vital role in various other natural language applications as diverse as machine translation, part-of-speech tagging, Intelligent Input Method and Text To Speech system. You can find more on this topic from my Statistical Language Modeling page.

Statistical Machine Translation

MT is probably the most difficult problem in Natural Language Processing; yet it is also the most fascinate one.

Our lab has a solid background in Machine Translation research. I participated in an Extended-Chunk based Machine Translation research, a National 973 funding project. The highlight of our approach lies in the use of multi-word translation pair (called Extended-Chunk) to promote translation result of a template MT system. In fighting with the data sparse problem during the building of a statistical MT system, we find the multi-word translation pair can greatly enhance the translation result of some sentences. For example, in our prototype system the English sentence "I saw the heavy sea" can be translated beautifully into "我看见了波涛汹涌的大海", for the pair "the heavy sea | 波涛汹涌的大海" can be found in our extended chunk base.

Together with Dr. LV Xue-qiang, Ren Fei-liang and Wu Hong-lin, we devised several new algorithms and models for acquiring multi-word units from large monolingual corpus, performing sentence/word level alignment and fast template matching based on different similarity measurements. More information can be found on our MT project page.

There is also a screen shot, by Yang Er-bao.

Chinese Natural Language Processing

Chinese is quit different from various western language. One characteristic of Chinese language is that it is written continuously without word separators between words. As a result, Chinese (and other eastern language like Japanese & Korean) Natural Language Processing requires a word segmentation process to split sentence into words (there is much controversy on whether Chinese needs a segmentation procedure among native researchers). The main question of interest in word segmentation is What is a word in Chinese.
You can find more information on Chinese NLP from our lab's homepage.