Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7393

Incorrect ICUTokenization on South East Asian Language

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.5
    • 6.2, 7.0
    • modules/analysis
    • None
    • Ubuntu

    • New

    Description

      Lucene 4.10.3 correctly tokenize a syllable into one token. However in Lucune 5.5.0 it end up being two tokens which is incorrect. Please let me know segmentation rules are implemented by native speakers of a particular language? In this particular example, it is M-y-a-n-m-a-r language. I have understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category. Thanks a lot.

      Example 4.10.3

      GET _analyze?tokenizer=icu_tokenizer&text="နည်"
      {
         "tokens": [
            {
               "token": "နည်",
               "start_offset": 1,
               "end_offset": 4,
               "type": "<ALPHANUM>",
               "position": 1
            }
         ]
      }
      

      Example 5.5.0

      GET _analyze?tokenizer=icu_tokenizer&text="နည်"
      {
        "tokens": [
          {
            "token": "န",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<ALPHANUM>",
            "position": 0
          },
          {
            "token": "ည်",
            "start_offset": 1,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 1
          }
        ]
      }
      

      Attachments

        1. LUCENE-7393.patch
          19 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            aungmaw AM
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: