Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1545

Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Linux x86_64, Sun Java 1.6

      Description

      Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
      The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
      Expected result is only on token "moͤchte".

        Attachments

        1. AnalyzerTest.java
          0.5 kB
          Andreas Hauser

          Issue Links

            Activity

              People

              • Assignee:
                steve_rowe Steve Rowe
                Reporter:
                andyhauser Andreas Hauser
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: