[LUCENE-1689] supplementary character handling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.

supplementary character support should be fixed for code that works with char/char[]

For example:
StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
LowercaseFilter should be modified to lowercase suppl. characters correctly.
CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.

in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

testCurrentBehavior.txt
15/Jun/09 01:12
8 kB
Robert Muir
LUCENE-1689.patch
31/Jul/09 22:21
7 kB
Robert Muir
LUCENE-1689.patch
09/Aug/09 04:12
19 kB
Robert Muir
LUCENE-1689.patch
09/Aug/09 12:42
52 kB
Robert Muir
LUCENE-1689_lowercase_example.txt
12/Jun/09 18:48
1.0 kB
Robert Muir

Issue Links

incorporates

LUCENE-2847 Support all of unicode in StandardTokenizer

Closed

LUCENE-2068 fix reverseStringFilter for unicode 4.0

Closed

LUCENE-2183 Supplementary Character Handling in CharTokenizer

Closed

LUCENE-2069 fix LowerCaseFilter for unicode 4.0

Closed

LUCENE-2070 document LengthFilter wrt Unicode 4.0

Closed

is related to

LUCENE-2094 Prepare CharArraySet for Unicode 4.0

Closed

(1 is related to)

Activity

People

Assignee:: Unassigned

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Jun/09 18:35

Updated:: 28/Aug/22 12:02

Resolved:: 12/Jan/13 23:06