Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6837

Add N-best output capability to JapaneseTokenizer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 5.3
    • 6.0
    • modules/analysis
    • None
    • New

    Description

      Japanese morphological analyzers often generate mis-segmented tokens. N-best output reduces the impact of mis-segmentation on search result. N-best output is more meaningful than character N-gram, and it increases hit count too.

      If you use N-best output, you can get decompounded tokens (ex: "シニアソフトウェアエンジニア" =>

      {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}

      ) and overwrapped tokens (ex: "数学部長谷川" =>

      {"数学", "部", "部長", "長谷川", "谷川"}

      ), depending on the dictionary and N-best parameter settings.

      Attachments

        1. LUCENE-6837.patch
          51 kB
          Christian Moen
        2. LUCENE-6837.patch
          51 kB
          KONNO, Hiroharu
        3. LUCENE-6837.patch
          51 kB
          Christian Moen
        4. LUCENE-6837.patch
          42 kB
          Christian Moen
        5. LUCENE-6837.patch
          32 kB
          KONNO, Hiroharu
        6. LUCENE-6837 for 5.4.zip
          7.84 MB
          Ippei UKAI

        Activity

          People

            cm Christian Moen
            hkonno KONNO, Hiroharu
            Votes:
            3 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: