Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6111

Add Chinese Word Segmentation Analyzer with Ansj implementation

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.6
    • Fix Version/s: 4.6
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      New, Patch Available

      Description

      When I use mahout-0.9 depending on lucene-4.6 to run Kmeans clustering algorithm, I find that the default word segmentation analyzer class named 'org.apache.lucene.analysis.standard.StandardAnalyzer' is very ugly, only single word could be splitted.However, ansj Chinese word segmentation tool is widely used in Chinese document-tokenizer, and I am willing to add it to support lucene.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              deyinchen deyinchen
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified