Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
5.5
-
None
-
Ubuntu
-
New
Description
Lucene 4.10.3 correctly tokenize a syllable into one token. However in Lucune 5.5.0 it end up being two tokens which is incorrect. Please let me know segmentation rules are implemented by native speakers of a particular language? In this particular example, it is M-y-a-n-m-a-r language. I have understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category. Thanks a lot.
Example 4.10.3
GET _analyze?tokenizer=icu_tokenizer&text="နည်" { "tokens": [ { "token": "နည်", "start_offset": 1, "end_offset": 4, "type": "<ALPHANUM>", "position": 1 } ] }
Example 5.5.0
GET _analyze?tokenizer=icu_tokenizer&text="နည်" { "tokens": [ { "token": "န", "start_offset": 0, "end_offset": 1, "type": "<ALPHANUM>", "position": 0 }, { "token": "ည်", "start_offset": 1, "end_offset": 3, "type": "<ALPHANUM>", "position": 1 } ] }