Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1545

Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Linux x86_64, Sun Java 1.6

      Description

      Standard analyzer does not correctly tokenize combining character U+0364 COMBINING LATIN SMALL LETTRE E.
      The word "moͤchte" is incorrectly tokenized into "mo" "chte", the combining character is lost.
      Expected result is only on token "moͤchte".

      1. AnalyzerTest.java
        0.5 kB
        Andreas Hauser

        Issue Links

          Activity

          Hide
          gsingers Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          gsingers Grant Ingersoll added a comment - Bulk close for 3.1
          Hide
          rcmuir Robert Muir added a comment -

          fixed in LUCENE-2167

          Show
          rcmuir Robert Muir added a comment - fixed in LUCENE-2167
          Hide
          steve_rowe Steve Rowe added a comment -

          I updated AnalyzerTest.java:

          import java.io.FileOutputStream;
          import java.io.OutputStreamWriter;
          import java.io.StringReader;
          import org.apache.lucene.analysis.TokenStream;
          import org.apache.lucene.analysis.standard.StandardAnalyzer;
          import org.apache.lucene.util.Version;
          
          public class AnalyzerTest {
            public static void test() throws Exception {
              StandardAnalyzer a = new StandardAnalyzer(Version.LUCENE_31);
              TokenStream ts = a.tokenStream("", new StringReader("moͤchte m mo\u0364chte "));
              OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream("output.txt"), "UTF-8");
              while (ts.incrementToken()) {
                writer.append(ts.toString()).append(System.getProperty("line.separator"));
              }
              writer.flush();
              writer.close();
            }
          
            public static void main(String[] argv) throws Exception {
              test();
            }
          }
          

          Here's what goes into output.txt when I compile AnalyzerTest.java with javac -encoding UTF-8 -cp lucene/dev/branches/branch_3x/lucene/build/lucene-core-3.1-SNAPSHOT.jar" AnalyzerTest:

          (moͤchte,startOffset=0,endOffset=7,positionIncrement=1,type=<ALPHANUM>)
          (m,startOffset=8,endOffset=9,positionIncrement=1,type=<ALPHANUM>)
          (moͤchte,startOffset=10,endOffset=17,positionIncrement=1,type=<ALPHANUM>)
          

          With LUCENE-2167 committed on the 3.X branch and on trunk, I think this issue is resolved. Please reopen if you see different behavior.

          Show
          steve_rowe Steve Rowe added a comment - I updated AnalyzerTest.java: import java.io.FileOutputStream; import java.io.OutputStreamWriter; import java.io.StringReader; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.util.Version; public class AnalyzerTest { public static void test() throws Exception { StandardAnalyzer a = new StandardAnalyzer(Version.LUCENE_31); TokenStream ts = a.tokenStream( "", new StringReader(" moͤchte m mo\u0364chte ")); OutputStreamWriter writer = new OutputStreamWriter( new FileOutputStream( "output.txt" ), "UTF-8" ); while (ts.incrementToken()) { writer.append(ts.toString()).append( System .getProperty( "line.separator" )); } writer.flush(); writer.close(); } public static void main( String [] argv) throws Exception { test(); } } Here's what goes into output.txt when I compile AnalyzerTest.java with javac -encoding UTF-8 -cp lucene/dev/branches/branch_3x/lucene/build/lucene-core-3.1-SNAPSHOT.jar" AnalyzerTest : (moͤchte,startOffset=0,endOffset=7,positionIncrement=1,type=<ALPHANUM>) (m,startOffset=8,endOffset=9,positionIncrement=1,type=<ALPHANUM>) (moͤchte,startOffset=10,endOffset=17,positionIncrement=1,type=<ALPHANUM>) With LUCENE-2167 committed on the 3.X branch and on trunk, I think this issue is resolved. Please reopen if you see different behavior.
          Hide
          rcmuir Robert Muir added a comment -

          michael, I don't see a way from the manual to do it.

          its not just the rules, but the JRE used to compile the rules (and its underlying unicode defs) so you might need separate standardtokenizerimpl's to really control the thing...

          Show
          rcmuir Robert Muir added a comment - michael, I don't see a way from the manual to do it. its not just the rules, but the JRE used to compile the rules (and its underlying unicode defs) so you might need separate standardtokenizerimpl's to really control the thing...
          Hide
          mikemccand Michael McCandless added a comment -

          but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this.

          Is it possible to conditionalize, at runtime, certain parts of a JFlex grammar? Ie, with matchVersion (LUCENE-1684) we could preserve back-compat on this issue, but I'm not sure how to cleanly push that matchVersion (provided @ runtime to StandardAnalyzer's ctor) "down" into the grammar so that eg we're not force to make a new full copy of the grammar for each fix. (Though perhaps that's an OK solution since it'd make it easy to strongly guarantee back compat...).

          Show
          mikemccand Michael McCandless added a comment - but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this. Is it possible to conditionalize, at runtime, certain parts of a JFlex grammar? Ie, with matchVersion ( LUCENE-1684 ) we could preserve back-compat on this issue, but I'm not sure how to cleanly push that matchVersion (provided @ runtime to StandardAnalyzer's ctor) "down" into the grammar so that eg we're not force to make a new full copy of the grammar for each fix. (Though perhaps that's an OK solution since it'd make it easy to strongly guarantee back compat...).
          Hide
          mikemccand Michael McCandless added a comment -

          Mark, when we push, we should push to 3.1 not 3.0 (I just added a 3.1 version to Jira for Lucene)... because 3.0 will come quickly after 2.9 and will "only" remove deprecations, etc.

          Show
          mikemccand Michael McCandless added a comment - Mark, when we push, we should push to 3.1 not 3.0 (I just added a 3.1 version to Jira for Lucene)... because 3.0 will come quickly after 2.9 and will "only" remove deprecations, etc.
          Hide
          rcmuir Robert Muir added a comment -

          if you are looking for a more short-term solution (since i think 1488 will take quite a bit more time), it would be possible to make StandardAnalyzer more 'unicode-friendly'.

          its not possible to make it 'correct', and adding additional unicode friendliness would make backwards compat a much more complex issue (different unicode versions across JVM versions, etc).

          but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this.

          Show
          rcmuir Robert Muir added a comment - if you are looking for a more short-term solution (since i think 1488 will take quite a bit more time), it would be possible to make StandardAnalyzer more 'unicode-friendly'. its not possible to make it 'correct', and adding additional unicode friendliness would make backwards compat a much more complex issue (different unicode versions across JVM versions, etc). but if you want, i'm willing to come up with some minor grammar changes for StandardAnalyzer that could help things like this.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Feel free to switch back, but for now I'm going to mark this as part of LUCENE-1488, as offhand, that looks like the best solution for this issue. As that issue is not marked 2.9 at the moment, I'm pushing this off to 3.0.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Feel free to switch back, but for now I'm going to mark this as part of LUCENE-1488 , as offhand, that looks like the best solution for this issue. As that issue is not marked 2.9 at the moment, I'm pushing this off to 3.0.
          Hide
          rcmuir Robert Muir added a comment -

          this is an example of why i started messing with LUCENE-1488

          Show
          rcmuir Robert Muir added a comment - this is an example of why i started messing with LUCENE-1488
          Hide
          andyhauser Andreas Hauser added a comment -

          $ java -Dfile.encoding=UTF-8 -cp lib/lucene-core-2.4-20090219.021329-1.jar:. AnalyzerTest
          (mo,0,2,type=<ALPHANUM>)
          (chte,3,7,type=<ALPHANUM>)
          (m,8,9,type=<ALPHANUM>)
          (mo,10,12,type=<ALPHANUM>)
          (chte,13,17,type=<ALPHANUM>)
          $locale
          LANG=de_DE.UTF-8
          LC_CTYPE="de_DE.UTF-8"
          LC_NUMERIC="de_DE.UTF-8"
          LC_TIME="de_DE.UTF-8"
          LC_COLLATE=de_DE.UTF-8
          LC_MONETARY="de_DE.UTF-8"
          LC_MESSAGES=de_DE.UTF-8
          LC_PAPER="de_DE.UTF-8"
          LC_NAME="de_DE.UTF-8"
          LC_ADDRESS="de_DE.UTF-8"
          LC_TELEPHONE="de_DE.UTF-8"
          LC_MEASUREMENT="de_DE.UTF-8"
          LC_IDENTIFICATION="de_DE.UTF-8"
          LC_ALL=

          Show
          andyhauser Andreas Hauser added a comment - $ java -Dfile.encoding=UTF-8 -cp lib/lucene-core-2.4-20090219.021329-1.jar:. AnalyzerTest (mo,0,2,type=<ALPHANUM>) (chte,3,7,type=<ALPHANUM>) (m,8,9,type=<ALPHANUM>) (mo,10,12,type=<ALPHANUM>) (chte,13,17,type=<ALPHANUM>) $locale LANG=de_DE.UTF-8 LC_CTYPE="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_COLLATE=de_DE.UTF-8 LC_MONETARY="de_DE.UTF-8" LC_MESSAGES=de_DE.UTF-8 LC_PAPER="de_DE.UTF-8" LC_NAME="de_DE.UTF-8" LC_ADDRESS="de_DE.UTF-8" LC_TELEPHONE="de_DE.UTF-8" LC_MEASUREMENT="de_DE.UTF-8" LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL=

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              andyhauser Andreas Hauser
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development