[LUCENE-444] StandardTokenizer loses Korean characters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.9
Component/s: modules/analysis
Labels:
None

Description

While using StandardAnalyzer, exp. StandardTokenizer with Korean text stream, StandardTokenizer ignores the Korean characters. This is because the definition of CJK token in StandardTokenizer.jj JavaCC file doesn't have enough range covering Korean syllables described in Unicode character map.
This patch adds one line of 0xAC00~0xD7AF, the Korean syllables range to the StandardTokenizer.jj code.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StandardTokenizer_Korean.patch
04/Oct/05 23:28
0.3 kB
Cheolgoo Kang

Activity

People

Assignee:: Unassigned

Reporter:: Cheolgoo Kang

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 04/Oct/05 23:25

Updated:: 28/Nov/24 16:14

Resolved:: 05/Oct/05 12:54