[LUCENE-478] CJK char list - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.4
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Description

Seems the character list in the CJK section of the StandardTokenizer.jj is not quite complete. Following is a more complete list:

< CJK: // non-alphabets
[
"\u1100"-"\u11ff",
"\u3040"-"\u30ff",
"\u3130"-"\u318f",
"\u31f0"-"\u31ff",
"\u3300"-"\u337f",
"\u3400"-"\u4dbf",
"\u4e00"-"\u9fff",
"\uac00"-"\ud7a3",
"\uf900"-"\ufaff",
"\uff65"-"\uffdc"
]
>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

StandardTokenizer.jj.diff
07/Jan/06 01:32
1.0 kB
Steven Rowe
StandardTokenizer.jj.diff
05/Jan/06 10:12
1.0 kB
Steven Rowe

Activity

People

Assignee:: Otis Gospodnetic

Reporter:: John Wang

Votes:: 2 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 08/Dec/05 09:54

Updated:: 28/Aug/22 11:24

Resolved:: 13/Aug/06 07:24