N-Gram Extraction Tools
This is a set of tools for manipulating N-grams from raw corpus. The
program implements [Nagao94]'s arbitrary N-gram extracting algorithm and can
extract both word N-grams (Latin language such as English) and character
N-grams (Eastern language such as Chinese) up to 255-grams.
Text is represented as Unicode (UCS-2) internally so in theory the program
can handle any text that can be converted into (or from) Unicode (UCS-2).
Default input/output encoding is set to UTF-8. Current implementation focuses
on processing Chinese (GBK) and English (ISO-8859-1) text.
The tools provided here distinguish themselves from other N-Gram
extraction programs in that:
- Unicode code base
All text are represented as Unicode (UCS-2) internally for easily handling
of oriental languages such as Chinese, Korean or Japanese.
- Both Word N-gram and Character N-gram can be extracted
This feature is especially useful for Oriental language processing. Since
in those languages (like Chinese) there is no explicit word boundary
between words, and the basic processing units are usually single
- Statistical Substring Reduction (SSR) algorithms
Statistical Substring Reduction is a powerful procedure to remove redundant
N-grams from an N-gram set using N-gram statistics information. For
example, if both "People's republic of China" and "People's republic" occur
10 times in the corpus, the latter will be removed, for it is the substring
of the former. The detail can be found in this paper.
The programs are all open source and are distributed under MIT license.
Three programs are provided: ``text2ngram'', ``extractngram'' and
``strreduction''. Run these programs with ``-h'' option to see a brief
Extract N-Gram statistics from raw corpus
text2ngram can be used to extract word/character level N-Gram
statistics from raw corpus. Here are some examples:
Performing Statistical Substring Reduction on acquired N-Gram
strreduction implements four Statistical Substring Reduction
algorithms. Here are some examples:
Perform word level SSR on the input stream using algorithm 2, and output
the result to output. The input format is one "word frequency" pair per
strreduction -a2 < input > output
The same as the above example but we use character level SSR this time on
Chinese input stream encoded in GBK (-F gbk). Output is GBK too (-T gbk).
The input format is one "word frequency" pair per line.
strreduction -a2 -c -F gbk -T gbk < input > output
The same as the above example but we use a reduction threshold of 3 instead
of the default value of 1 (-f 3). The output is also sorted (-s). Finally
the SSR processing time is printed on stderr (-t).
strreduction -a2 -c -F gbk -T gbk -s -t -f 3 < input >
- Statistical Substring Reduction in Linear
Xueqiang Lv, Le Zhang and Junfeng Hu. IJCNLP-04, Hai Nan island,
P.R.China. [ps.gz] [pdf]
- LV Xue-qiang. Research of E-Chunk Acquisition and Application in
Machine Translation. Ph.D. dissertation (in Chinese), Northeastern University, Shen
Yang, China, Jan, 2003.
- A Statistical Approach to Extract Chinese
Chunk Candidates from Large Corpora
Zhang Le, LV Xue-qiang, SHEN Yan-na, YAO Tian-shun. In proceeding
of 20th International Conference on Computer Processing of Oriental
Languages (ICCPOL03), ShenYang, P.R.China. [ps.gz] [pdf] [slide]
A New Method of N-gram Statistics for Large Number of n and
Automatic Extraction of Words and Phrases from Large Text Data of
Makoto Nagao and Shinsuke Mori. The 15th International Conference on
Computational Linguistics (COLING 1994). [pdf]