[LUCENE-1545] Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4
Fix Version/s: 3.1, 4.0-ALPHA
Component/s: modules/analysis
Labels:
None
Environment:

Linux x86_64, Sun Java 1.6

Description

Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
Expected result is only on token "moͤchte".

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

AnalyzerTest.java
20/Feb/09 15:24
0.5 kB
Andreas Hauser

Issue Links

is part of

LUCENE-1488 multilingual analyzer based on icu

Closed

LUCENE-2167 Implement StandardTokenizer with the UAX#29 Standard

Closed

Activity

People

Assignee:: Steven Rowe

Reporter:: Andreas Hauser

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 20/Feb/09 15:21

Updated:: 28/Aug/22 11:57

Resolved:: 29/Sep/10 05:53