[LUCENE-7393] Incorrect ICUTokenization on South East Asian Language - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.5
Fix Version/s: 6.2, 7.0
Component/s: modules/analysis
Labels:
None
Environment:

Ubuntu

Lucene Fields:

New

Description

Lucene 4.10.3 correctly tokenize a syllable into one token. However in Lucune 5.5.0 it end up being two tokens which is incorrect. Please let me know segmentation rules are implemented by native speakers of a particular language? In this particular example, it is M-y-a-n-m-a-r language. I have understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category. Thanks a lot.

Example 4.10.3

GET _analyze?tokenizer=icu_tokenizer&text="နည်"
{
   "tokens": [
      {
         "token": "နည်",
         "start_offset": 1,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 1
      }
   ]
}

Example 5.5.0

GET _analyze?tokenizer=icu_tokenizer&text="နည်"
{
  "tokens": [
    {
      "token": "န",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "ည်",
      "start_offset": 1,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-7393.patch
25/Jul/16 04:39
19 kB
Robert Muir

Activity

People

Assignee:: Unassigned

Reporter:: AM

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Jul/16 10:57

Updated:: 28/Aug/22 15:01

Resolved:: 28/Jul/16 15:15