This page lists softwares I wrote or maintained. Some of them are really useful, some are just written for fun. All of them are freeware and placed here in the hope they may be useful, but WITHOUT ANY WARRANTY.

Zhcon

As I got familiar with linux I found linux lacks some support for display and input Chinese characters(Han Zi) on console. Though there has been several Chinese console environment available at that time, none of them met my need. So I developed a small software call zhcon.

Zhcon is a fast Chinese/Japanese/Korean console environment for linux which adds the ability of displaying and inputing CJK multi-byte characters to linux console in a device in-dependent way by running on top of linux frame-buffer device. With the help of many other warm hearted people across the net, zhcon has gained such a popularity in Chinese linux community that I'd never expected when I first released it.

Currently I'm looking for a way to add UTF-8 support to zhcon which means the next version of zhcon is expected to display multi-language UTF-8 text on one console. Updated: zhcon-0.2.5 now supports UTF-8.

Have a look on http://zhcon.sourceforge.net if you want to know more about zhcon.

Morphix-NLP Live CD

Morphix-NLP Live CD is a Linux distribution with a rich collection of NLP applications that runs directly off a CD. This CD is based on Morphix Project, a variant of Knoppix Live Linux. The highlight of this CD is various NLP software included, ranging from POS tagger, Chinese word segmenter, Machine Learning toolkit, to Head Driven Parsers.

The idea of building such a CD came in to my mind as I found more and more interesting NLP software around the net. Very unfortunately, most of them can not be found in mainstream Linux distributions and have a rather complex building procedure (some require patches to work on modern platforms). Thinking that such a CD may be useful for NLP students, researchers as well as non-academic persons who just want to see what NLP can do, I start the interesting project.

libconfig

libconfig is the ISO C++ port of KDE KConfig class. It has almost the same functionality as KConfig in KDE lib and runs on more platforms. You may need this library if you want to have a flexible config file parser in C++. You can download libconfig from here.

Maximum Entropy Modeling Toolkit for Python and C++

Maximum Entropy (Maxent) Model is a general purpose machine learning technique has been successfully applied to Computer Vision, Spatial Physics, Natural Language Processing and many other fields. Want to know more about maxent? See this page. This Toolkit implements common algorithms and tools used in Maximum Entropy Modeling including:

  • Conditional Maximum Entropy Model
  • L-BFGS Parameter Estimation
  • GIS Parameter Estimation (partly implemented)
  • Gaussian Prior Smoothing
  • C++ API
  • Python maxent module

A state-of-the-art Part-of-Speech tagger (in python) is included as a sample application of the toolkit. When trained on 00-18 sections of WSJ corpus and tested on 19-24 sections of WSJ corpus this tagger boasts an overall 96.64% word accuracy.

N-gram Extraction Tool

This is a set of tools for extracting arbitrary N-grams from raw corpus. It is written in C++ with full Unicode support (default encoding is UTF-8). The extraction algorithm is a slightly modified version described in [Nagao94]. Both word level and character level N-grams are supported. There is also a substring reduction tool to perform statistical substring reduction described in (Zhang Le et al. 2003).

CipSeg: A Chinese Word Segmenter

This is the unix port of CipSeg, the Chinese Word Segmenter developed by Natural Language Processing Lab (http://www.nlplab.cn). The segmenter is free of charge for research purpose. Binary versions are available for GNU/Linux and FreeBSD.

Edit - Print