Lucene - Core
  1. Lucene - Core
  2. LUCENE-1545

Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Linux x86_64, Sun Java 1.6

      Description

      Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
      The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
      Expected result is only on token "moͤchte".

      1. AnalyzerTest.java
        0.5 kB
        Andreas Hauser

        Issue Links

          Activity

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Andreas Hauser
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development