Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6837

Add N-best output capability to JapaneseTokenizer

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 5.3
    • Fix Version/s: 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Japanese morphological analyzers often generate mis-segmented tokens. N-best output reduces the impact of mis-segmentation on search result. N-best output is more meaningful than character N-gram, and it increases hit count too.

      If you use N-best output, you can get decompounded tokens (ex: "シニアソフトウェアエンジニア" =>

      {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}

      ) and overwrapped tokens (ex: "数学部長谷川" =>

      {"数学", "部", "部長", "長谷川", "谷川"}

      ), depending on the dictionary and N-best parameter settings.

        Attachments

        1. LUCENE-6837.patch
          51 kB
          Christian Moen
        2. LUCENE-6837.patch
          51 kB
          KONNO, Hiroharu
        3. LUCENE-6837.patch
          51 kB
          Christian Moen
        4. LUCENE-6837.patch
          42 kB
          Christian Moen
        5. LUCENE-6837.patch
          32 kB
          KONNO, Hiroharu
        6. LUCENE-6837 for 5.4.zip
          7.84 MB
          Ippei UKAI

          Activity

            People

            • Assignee:
              cm Christian Moen
              Reporter:
              hkonno KONNO, Hiroharu
            • Votes:
              3 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: