Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
2.4
-
None
-
Linux x86_64, Sun Java 1.6
Description
Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
Expected result is only on token "moͤchte".
Attachments
Attachments
Issue Links
- is part of
-
LUCENE-1488 multilingual analyzer based on icu
- Closed
-
LUCENE-2167 Implement StandardTokenizer with the UAX#29 Standard
- Closed