Lucene - Core
  1. Lucene - Core
  2. LUCENE-3730

Improved Kuromoji search mode segmentation/decompounding

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      Kuromoji has a segmentation mode for search that uses a heuristic to promote additional segmentation of long candidate tokens to get a decompounding effect. This heuristic has been improved. Patch is coming up.

        Activity

        Hide
        Christian Moen added a comment -

        Find attached a patch for trunk that improves the heuristic. Search segmentation tests/examples are in search-segmentation-tests.txt and is validated by TestSearchMode.

        Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models.

        I've also moved the ASF license header in TestExtendedMode.java to the right place.

        Show
        Christian Moen added a comment - Find attached a patch for trunk that improves the heuristic. Search segmentation tests/examples are in search-segmentation-tests.txt and is validated by TestSearchMode . Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models. I've also moved the ASF license header in TestExtendedMode.java to the right place.
        Hide
        Christian Moen added a comment -

        If you want to try the new search mode, there's a simple Kuromoji web interface available on http://atilika.org/kuromoji that perhaps is useful. After inputing some text and pressing enter, click "normal mode" to switch to "search mode" to test the various segmentation modes for the given input.

        Show
        Christian Moen added a comment - If you want to try the new search mode, there's a simple Kuromoji web interface available on http://atilika.org/kuromoji that perhaps is useful. After inputing some text and pressing enter, click "normal mode" to switch to "search mode" to test the various segmentation modes for the given input.
        Hide
        Robert Muir added a comment -

        Patch looks good to me... so the basics are we apply a different penalty based on
        whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ?

        Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models.

        I think this is ok for now.
        Long term (if there end out being different values for other dictionaries), we can conditionalize these on dictionary type:
        either at build-time (recording these values into dictionary), or better, record the dictionary type itself and conditionalize
        these at run-time based on dictionary type.

        By recording the type, we would also be able to use e.g. assumeTrue(dictionaryType == IPADIC) in unit tests and things like that,
        and who knows what else, but lets not worry about it here.

        Show
        Robert Muir added a comment - Patch looks good to me... so the basics are we apply a different penalty based on whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ? Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models. I think this is ok for now. Long term (if there end out being different values for other dictionaries), we can conditionalize these on dictionary type: either at build-time (recording these values into dictionary), or better, record the dictionary type itself and conditionalize these at run-time based on dictionary type. By recording the type, we would also be able to use e.g. assumeTrue(dictionaryType == IPADIC) in unit tests and things like that, and who knows what else, but lets not worry about it here.
        Hide
        Christian Moen added a comment -

        Patch looks good to me... so the basics are we apply a different penalty based on
        whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ?

        Thanks a lot, Robert. That's correct.

        I agree completely regarding other dictionary support.

        Show
        Christian Moen added a comment - Patch looks good to me... so the basics are we apply a different penalty based on whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ? Thanks a lot, Robert. That's correct. I agree completely regarding other dictionary support.
        Hide
        Robert Muir added a comment -

        Thanks Christian!

        Show
        Robert Muir added a comment - Thanks Christian!

          People

          • Assignee:
            Robert Muir
            Reporter:
            Christian Moen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development