Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
for Java 5. Java 5 is based on unicode 4, which means variable-width encoding.
supplementary character support should be fixed for code that works with char/char[]
For example:
StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly.
LowercaseFilter should be modified to lowercase suppl. characters correctly.
CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int.
in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work.
Attachments
Attachments
Issue Links
- incorporates
-
LUCENE-2847 Support all of unicode in StandardTokenizer
- Closed
-
LUCENE-2068 fix reverseStringFilter for unicode 4.0
- Closed
-
LUCENE-2183 Supplementary Character Handling in CharTokenizer
- Closed
-
LUCENE-2069 fix LowerCaseFilter for unicode 4.0
- Closed
-
LUCENE-2070 document LengthFilter wrt Unicode 4.0
- Closed
- is related to
-
LUCENE-2094 Prepare CharArraySet for Unicode 4.0
- Closed