Lucene - Core
  1. Lucene - Core
  2. LUCENE-1545

Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Linux x86_64, Sun Java 1.6

      Description

      Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
      The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
      Expected result is only on token "moͤchte".

      1. AnalyzerTest.java
        0.5 kB
        Andreas Hauser

        Issue Links

          Activity

          Andreas Hauser created issue -
          Hide
          Andreas Hauser added a comment -

          $ java -Dfile.encoding=UTF-8 -cp lib/lucene-core-2.4-20090219.021329-1.jar:. AnalyzerTest
          (mo,0,2,type=<ALPHANUM>)
          (chte,3,7,type=<ALPHANUM>)
          (m,8,9,type=<ALPHANUM>)
          (mo,10,12,type=<ALPHANUM>)
          (chte,13,17,type=<ALPHANUM>)
          $locale
          LANG=de_DE.UTF-8
          LC_CTYPE="de_DE.UTF-8"
          LC_NUMERIC="de_DE.UTF-8"
          LC_TIME="de_DE.UTF-8"
          LC_COLLATE=de_DE.UTF-8
          LC_MONETARY="de_DE.UTF-8"
          LC_MESSAGES=de_DE.UTF-8
          LC_PAPER="de_DE.UTF-8"
          LC_NAME="de_DE.UTF-8"
          LC_ADDRESS="de_DE.UTF-8"
          LC_TELEPHONE="de_DE.UTF-8"
          LC_MEASUREMENT="de_DE.UTF-8"
          LC_IDENTIFICATION="de_DE.UTF-8"
          LC_ALL=

          Show
          Andreas Hauser added a comment - $ java -Dfile.encoding=UTF-8 -cp lib/lucene-core-2.4-20090219.021329-1.jar:. AnalyzerTest (mo,0,2,type=<ALPHANUM>) (chte,3,7,type=<ALPHANUM>) (m,8,9,type=<ALPHANUM>) (mo,10,12,type=<ALPHANUM>) (chte,13,17,type=<ALPHANUM>) $locale LANG=de_DE.UTF-8 LC_CTYPE="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_COLLATE=de_DE.UTF-8 LC_MONETARY="de_DE.UTF-8" LC_MESSAGES=de_DE.UTF-8 LC_PAPER="de_DE.UTF-8" LC_NAME="de_DE.UTF-8" LC_ADDRESS="de_DE.UTF-8" LC_TELEPHONE="de_DE.UTF-8" LC_MEASUREMENT="de_DE.UTF-8" LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL=
          Andreas Hauser made changes -
          Field Original Value New Value
          Attachment AnalyzerTest.java [ 12400612 ]
          Hide
          Robert Muir added a comment -

          this is an example of why i started messing with LUCENE-1488

          Show
          Robert Muir added a comment - this is an example of why i started messing with LUCENE-1488
          Mark Miller made changes -
          Link This issue is part of LUCENE-1488 [ LUCENE-1488 ]
          Hide
          Mark Miller added a comment -

          Feel free to switch back, but for now I'm going to mark this as part of LUCENE-1488, as offhand, that looks like the best solution for this issue. As that issue is not marked 2.9 at the moment, I'm pushing this off to 3.0.

          Show
          Mark Miller added a comment - Feel free to switch back, but for now I'm going to mark this as part of LUCENE-1488 , as offhand, that looks like the best solution for this issue. As that issue is not marked 2.9 at the moment, I'm pushing this off to 3.0.
          Mark Miller made changes -
          Fix Version/s 3.0 [ 12312889 ]
          Fix Version/s 2.9 [ 12312682 ]
          Priority Major [ 3 ] Minor [ 4 ]
          Hide
          Robert Muir added a comment -

          if you are looking for a more short-term solution (since i think 1488 will take quite a bit more time), it would be possible to make StandardAnalyzer more 'unicode-friendly'.

          its not possible to make it 'correct', and adding additional unicode friendliness would make backwards compat a much more complex issue (different unicode versions across JVM versions, etc).

          but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this.

          Show
          Robert Muir added a comment - if you are looking for a more short-term solution (since i think 1488 will take quite a bit more time), it would be possible to make StandardAnalyzer more 'unicode-friendly'. its not possible to make it 'correct', and adding additional unicode friendliness would make backwards compat a much more complex issue (different unicode versions across JVM versions, etc). but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this.
          Hide
          Michael McCandless added a comment -

          Mark, when we push, we should push to 3.1 not 3.0 (I just added a 3.1 version to Jira for Lucene)... because 3.0 will come quickly after 2.9 and will "only" remove deprecations, etc.

          Show
          Michael McCandless added a comment - Mark, when we push, we should push to 3.1 not 3.0 (I just added a 3.1 version to Jira for Lucene)... because 3.0 will come quickly after 2.9 and will "only" remove deprecations, etc.
          Michael McCandless made changes -
          Fix Version/s 3.1 [ 12314025 ]
          Fix Version/s 3.0 [ 12312889 ]
          Hide
          Michael McCandless added a comment -

          but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this.

          Is it possible to conditionalize, at runtime, certain parts of a JFlex grammar? Ie, with matchVersion (LUCENE-1684) we could preserve back-compat on this issue, but I'm not sure how to cleanly push that matchVersion (provided @ runtime to StandardAnalyzer's ctor) "down" into the grammar so that eg we're not force to make a new full copy of the grammar for each fix. (Though perhaps that's an OK solution since it'd make it easy to strongly guarantee back compat...).

          Show
          Michael McCandless added a comment - but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this. Is it possible to conditionalize, at runtime, certain parts of a JFlex grammar? Ie, with matchVersion ( LUCENE-1684 ) we could preserve back-compat on this issue, but I'm not sure how to cleanly push that matchVersion (provided @ runtime to StandardAnalyzer's ctor) "down" into the grammar so that eg we're not force to make a new full copy of the grammar for each fix. (Though perhaps that's an OK solution since it'd make it easy to strongly guarantee back compat...).
          Hide
          Robert Muir added a comment -

          michael, I don't see a way from the manual to do it.

          its not just the rules, but the JRE used to compile the rules (and its underlying unicode defs) so you might need separate standardtokenizerimpl's to really control the thing...

          Show
          Robert Muir added a comment - michael, I don't see a way from the manual to do it. its not just the rules, but the JRE used to compile the rules (and its underlying unicode defs) so you might need separate standardtokenizerimpl's to really control the thing...
          Robert Muir made changes -
          Component/s contrib/analyzers [ 12312333 ]
          Component/s Analysis [ 12310230 ]
          Robert Muir made changes -
          Link This issue is part of LUCENE-2167 [ LUCENE-2167 ]
          Hide
          Steve Rowe added a comment -

          I updated AnalyzerTest.java:

          import java.io.FileOutputStream;
          import java.io.OutputStreamWriter;
          import java.io.StringReader;
          import org.apache.lucene.analysis.TokenStream;
          import org.apache.lucene.analysis.standard.StandardAnalyzer;
          import org.apache.lucene.util.Version;
          
          public class AnalyzerTest {
            public static void test() throws Exception {
              StandardAnalyzer a = new StandardAnalyzer(Version.LUCENE_31);
              TokenStream ts = a.tokenStream("", new StringReader("moͤchte m mo\u0364chte "));
              OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream("output.txt"), "UTF-8");
              while (ts.incrementToken()) {
                writer.append(ts.toString()).append(System.getProperty("line.separator"));
              }
              writer.flush();
              writer.close();
            }
          
            public static void main(String[] argv) throws Exception {
              test();
            }
          }
          

          Here's what goes into output.txt when I compile AnalyzerTest.java with javac -encoding UTF-8 -cp lucene/dev/branches/branch_3x/lucene/build/lucene-core-3.1-SNAPSHOT.jar" AnalyzerTest:

          (moͤchte,startOffset=0,endOffset=7,positionIncrement=1,type=<ALPHANUM>)
          (m,startOffset=8,endOffset=9,positionIncrement=1,type=<ALPHANUM>)
          (moͤchte,startOffset=10,endOffset=17,positionIncrement=1,type=<ALPHANUM>)
          

          With LUCENE-2167 committed on the 3.X branch and on trunk, I think this issue is resolved. Please reopen if you see different behavior.

          Show
          Steve Rowe added a comment - I updated AnalyzerTest.java: import java.io.FileOutputStream; import java.io.OutputStreamWriter; import java.io.StringReader; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.util.Version; public class AnalyzerTest { public static void test() throws Exception { StandardAnalyzer a = new StandardAnalyzer(Version.LUCENE_31); TokenStream ts = a.tokenStream( "", new StringReader(" moͤchte m mo\u0364chte ")); OutputStreamWriter writer = new OutputStreamWriter( new FileOutputStream( "output.txt" ), "UTF-8" ); while (ts.incrementToken()) { writer.append(ts.toString()).append( System .getProperty( "line.separator" )); } writer.flush(); writer.close(); } public static void main( String [] argv) throws Exception { test(); } } Here's what goes into output.txt when I compile AnalyzerTest.java with javac -encoding UTF-8 -cp lucene/dev/branches/branch_3x/lucene/build/lucene-core-3.1-SNAPSHOT.jar" AnalyzerTest : (moͤchte,startOffset=0,endOffset=7,positionIncrement=1,type=<ALPHANUM>) (m,startOffset=8,endOffset=9,positionIncrement=1,type=<ALPHANUM>) (moͤchte,startOffset=10,endOffset=17,positionIncrement=1,type=<ALPHANUM>) With LUCENE-2167 committed on the 3.X branch and on trunk, I think this issue is resolved. Please reopen if you see different behavior.
          Steve Rowe made changes -
          Assignee Steven Rowe [ steve_rowe ]
          Steve Rowe made changes -
          Fix Version/s 3.1 [ 12314822 ]
          Lucene Fields [New]
          Hide
          Robert Muir added a comment -

          fixed in LUCENE-2167

          Show
          Robert Muir added a comment - fixed in LUCENE-2167
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Mark Thomas made changes -
          Workflow jira [ 12453058 ] Default workflow, editable Closed status [ 12563782 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563782 ] jira [ 12585322 ]
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Shai Erera made changes -
          Component/s modules/analysis [ 12310230 ]
          Component/s contrib/analyzers [ 12312333 ]

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Andreas Hauser
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development