Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1629

contrib intelligent Analyzer for Chinese

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.1
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      for java 1.5 or higher, lucene 2.4.1

    • Lucene Fields:
      New, Patch Available

      Description

      I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/

      In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously!

      Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly.

      The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%.

      As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository.

        Attachments

        1. analysis-data.zip
          2.02 MB
          Xiaoping Gao
        2. LUCENE-1629-java1.4.patch
          139 kB
          Xiaoping Gao
        3. coredict.mem
          1.51 MB
          Xiaoping Gao
        4. bigramdict.mem
          4.60 MB
          Xiaoping Gao
        5. build-resources.patch
          7 kB
          Uwe Schindler
        6. build-resources.patch
          1 kB
          Uwe Schindler
        7. build-resources-with-folder.patch
          8 kB
          Uwe Schindler
        8. LUCENE-1629-encoding-fix.patch
          0.8 kB
          Uwe Schindler

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              pinker Xiaoping Gao
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: