Lucene - Core
  1. Lucene - Core
  2. LUCENE-3726

Default KuromojiAnalyzer to use search mode


    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
    • Lucene Fields:


      Kuromoji supports an option to segment text in a way more suitable for search,
      by preventing long compound nouns as indexing terms.

      In general 'how you segment' can be important depending on the application
      (see for some studies on this in chinese)

      The current algorithm punishes the cost based on some parameters (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
      for long runs of kanji.

      Some questions (these can be separate future issues if any useful ideas come out):

      • should these parameters continue to be static-final, or configurable?
      • should POS also play a role in the algorithm (can/should we refine exactly what we decompound)?
      • is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both?
        with a tokenfilter, one idea would be to also preserve the original indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0)
        from my understanding this tends to help with noun compounds in other languages, because IDF of the original term boosts 'exact' compound matches.
        but does a tokenfilter provide the segmenter enough 'context' to do this properly?

      Either way, I think as a start we should turn on what we have by default: its likely a very easy win.

      1. LUCENE-3726.patch
        4 kB
        Christian Moen
      2. LUCENE-3726.patch
        3 kB
        Christian Moen
      3. LUCENE-3726.patch
        2 kB
        Christian Moen
      4. kuromojieval.tar.gz
        2.01 MB
        Christian Moen



          • Assignee:
            Robert Muir
            Robert Muir
          • Votes:
            0 Vote for this issue
            3 Start watching this issue


            • Created: