Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7393

Incorrect ICUTokenization on South East Asian Language

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5
    • Fix Version/s: 6.2, 7.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Ubuntu

    • Lucene Fields:
      New

      Description

      Lucene 4.10.3 correctly tokenize a syllable into one token. However in Lucune 5.5.0 it end up being two tokens which is incorrect. Please let me know segmentation rules are implemented by native speakers of a particular language? In this particular example, it is M-y-a-n-m-a-r language. I have understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category. Thanks a lot.

      Example 4.10.3

      GET _analyze?tokenizer=icu_tokenizer&text="နည်"
      {
         "tokens": [
            {
               "token": "နည်",
               "start_offset": 1,
               "end_offset": 4,
               "type": "<ALPHANUM>",
               "position": 1
            }
         ]
      }
      

      Example 5.5.0

      GET _analyze?tokenizer=icu_tokenizer&text="နည်"
      {
        "tokens": [
          {
            "token": "န",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<ALPHANUM>",
            "position": 0
          },
          {
            "token": "ည်",
            "start_offset": 1,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 1
          }
        ]
      }
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aungmaw AM
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: