[LUCENE-8526] StandardTokenizer doesn't separate hangul characters from other non-CJK chars - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Not A Bug
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

It was first reported here https://github.com/elastic/elasticsearch/issues/34285.
I don't know if it's the expected behavior but the StandardTokenizer does not split words
which are composed of a mixed of non-CJK characters and hangul syllabs. For instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an alpha-numeric group. This breaks the CJKBigram token filter which will not build bigrams on such groups. The other CJK characters are correctly splitted when they are mixed with other alphabet so I'd expect the same for hangul.

Attachments

Issue Links

relates to

LUCENE-8527 Upgrade JFlex to 1.7.0

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jim Ferenczi

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Oct/18 18:11

Updated:: 28/Aug/22 15:36

Resolved:: 11/Oct/18 13:03