N-Gram Extraction Tools

Introduction | License | Download | Usage | Reference


This is a set of tools for manipulating N-grams from raw corpus. The program implements [Nagao94]'s arbitrary N-gram extracting algorithm and can extract both word N-grams (Latin language such as English) and character N-grams (Eastern language such as Chinese) up to 255-grams.

Text is represented as Unicode (UCS-2) internally so in theory the program can handle any text that can be converted into (or from) Unicode (UCS-2). Default input/output encoding is set to UTF-8. Current implementation focuses on processing Chinese (GBK) and English (ISO-8859-1) text.

The tools provided here distinguish themselves from other N-Gram extraction programs in that:


The programs are all open source and are distributed under MIT license.



Three programs are provided: ``text2ngram'', ``extractngram'' and ``strreduction''. Run these programs with ``-h'' option to see a brief usage.

Extract N-Gram statistics from raw corpus

text2ngram can be used to extract word/character level N-Gram statistics from raw corpus. Here are some examples:

Performing Statistical Substring Reduction on acquired N-Gram statistics

strreduction implements four Statistical Substring Reduction algorithms. Here are some examples: