Lucene - Core
  1. Lucene - Core
  2. LUCENE-3726

Default KuromojiAnalyzer to use search mode


    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
    • Lucene Fields:


      Kuromoji supports an option to segment text in a way more suitable for search,
      by preventing long compound nouns as indexing terms.

      In general 'how you segment' can be important depending on the application
      (see for some studies on this in chinese)

      The current algorithm punishes the cost based on some parameters (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
      for long runs of kanji.

      Some questions (these can be separate future issues if any useful ideas come out):

      • should these parameters continue to be static-final, or configurable?
      • should POS also play a role in the algorithm (can/should we refine exactly what we decompound)?
      • is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both?
        with a tokenfilter, one idea would be to also preserve the original indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0)
        from my understanding this tends to help with noun compounds in other languages, because IDF of the original term boosts 'exact' compound matches.
        but does a tokenfilter provide the segmenter enough 'context' to do this properly?

      Either way, I think as a start we should turn on what we have by default: its likely a very easy win.

      1. LUCENE-3726.patch
        4 kB
        Christian Moen
      2. LUCENE-3726.patch
        3 kB
        Christian Moen
      3. LUCENE-3726.patch
        2 kB
        Christian Moen
      4. kuromojieval.tar.gz
        2.01 MB
        Christian Moen


        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Robert Muir [ rcmuir ]
        Fix Version/s 3.6 [ 12319070 ]
        Fix Version/s 4.0 [ 12314025 ]
        Resolution Fixed [ 1 ]
        Christian Moen made changes -
        Attachment LUCENE-3726.patch [ 12513280 ]
        Christian Moen made changes -
        Attachment LUCENE-3726.patch [ 12513279 ]
        Christian Moen made changes -
        Attachment LUCENE-3726.patch [ 12513278 ]
        Christian Moen made changes -
        Field Original Value New Value
        Attachment kuromojieval.tar.gz [ 12512738 ]
        Robert Muir created issue -


          • Assignee:
            Robert Muir
            Robert Muir
          • Votes:
            0 Vote for this issue
            3 Start watching this issue


            • Created: