[LUCENE-1629] contrib intelligent Analyzer for Chinese - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.1
Fix Version/s: 2.9
Component/s: modules/analysis
Labels:
None
Environment:

for java 1.5 or higher, lucene 2.4.1

Lucene Fields:

New, Patch Available

Description

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously!

Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly.

The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%.

As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

analysis-data.zip
07/May/09 06:21
2.02 MB
Xiaoping Gao
bigramdict.mem
11/May/09 05:48
4.60 MB
Xiaoping Gao
build-resources.patch
14/May/09 07:05
1 kB
Uwe Schindler
build-resources.patch
11/May/09 21:39
7 kB
Uwe Schindler
build-resources-with-folder.patch
14/May/09 09:07
8 kB
Uwe Schindler
coredict.mem
11/May/09 05:48
1.51 MB
Xiaoping Gao
LUCENE-1629-encoding-fix.patch
14/May/09 14:13
0.8 kB
Uwe Schindler
LUCENE-1629-java1.4.patch
11/May/09 05:11
139 kB
Xiaoping Gao

Activity

People

Assignee:: Michael McCandless

Reporter:: Xiaoping Gao

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/May/09 04:51

Updated:: 28/Aug/22 12:00

Resolved:: 14/May/09 10:16