Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8966

KoreanTokenizer should split unknown words on digits

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: main (9.0), 8.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Since https://issues.apache.org/jira/browse/LUCENE-8548 the Korean tokenizer groups characters of unknown words if they belong to the same script or an inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't work well on digits since they are considered common with other scripts. For instance the input "44사이즈" is kept as is even though "사이즈" is part of the dictionary. We should restore the original behavior and splits any unknown words if a digit is followed by another type.

      This issue was first discovered in https://github.com/elastic/elasticsearch/issues/46365

        Attachments

        1. LUCENE-8966.patch
          3 kB
          Jim Ferenczi
        2. LUCENE-8966.patch
          3 kB
          Jim Ferenczi

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jim.ferenczi Jim Ferenczi
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: