Title: Chinese word segmentation
Place: Chinese Academy of Sciences 2009
  • Completed.
Keywords: Natural Language Processing, Information extraction

This project was a part of the project "Culture Grid - Dunhuang" in Knowledge Grid Group. "Dunhuang" is an important tourist attraction and the subject of lots of ongoing archaeological projects. The goal of "Culture Grid" is to manage the continuous research progress in Archaeology research and improve the sociologists' collaboration. Since most manuscripts are written in classical Chinese, my work was to implement the information extraction in classical Chinese.

memory impairment detection

In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word delimiter. However the equivalent to this character is not found in Chinese. Chinese word segmentation is challenging because it is often difficult to define what constitutes a word in Chinese.

For example: How to phrase this sentence:
"请将军服装入袋中" ?
"请/将军/服装/入/袋中" or "请/将/军服/装入/袋中"?
In fact, even google couldn't phrase correctly. Check the Google's result here.


Considering it's impossible to build a statistical corpus for that special language, I built my solution based on multi-layer HMM and used the statistical method for supplement.

Here is an example of processing:
1. Pre Segmentation =>
始##始 /王/晓/平/在/12/月/份/滦/南/大/会/上/说/的/确/实/在/理/末##末.
Extracted the number, english words and special symbols in a sentence.

2. Professional word detect. =>
始##始 /王/晓/平/在/12/月/份/滦南/大/会/上/说/的/确/实/在/理/末##末.
Some special words are common in the classical Chinese like "乙丑年","滦南".

3. Forward maximum matching segmentation. =>
始##始 /王/晓/平/在/12/月份/滦南/大/会/上/说/的/确实/在理/末##末.

4. NShortPath segmentation. => 始##始 /王/晓/平/在/12/月份/滦南/大会/上/说/的/确实/在理/末##末.
Applied the statistical model for modern language to classical Chinese.

5. Time and name entities detection. =>
始##始 /王晓平/在/12月份/滦南/大会/上/说/的/确实/在理/末##末.

The final result is: 始##始 /王晓平/在/12月份/滦南/大会/上/说/的/确实/在理/末##末.