Lucene - Core
  1. Lucene - Core
  2. LUCENE-5030

FuzzySuggester has to operate FSTs of Unicode-letters, not UTF-8, to work correctly for 1-byte (like English) and multi-byte (non-Latin) letters

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      There is a limitation in the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space.

      This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST.

      See the discussion here: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-td4067018.html#none

      1. benchmark-INFO_SEP.txt
        4 kB
        Artem Lukanin
      2. benchmark-old.txt
        4 kB
        Artem Lukanin
      3. benchmark-wo_convertion.txt
        4 kB
        Artem Lukanin
      4. LUCENE-5030.patch
        28 kB
        Artem Lukanin
      5. LUCENE-5030.patch
        29 kB
        Artem Lukanin
      6. LUCENE-5030.patch
        30 kB
        Artem Lukanin
      7. LUCENE-5030.patch
        29 kB
        Michael McCandless
      8. LUCENE-5030.patch
        178 kB
        Artem Lukanin
      9. LUCENE-5030.patch
        175 kB
        Artem Lukanin
      10. nonlatin_fuzzySuggester_combo.patch
        187 kB
        Artem Lukanin
      11. nonlatin_fuzzySuggester_combo1.patch
        37 kB
        Artem Lukanin
      12. nonlatin_fuzzySuggester_combo2.patch
        184 kB
        Artem Lukanin
      13. nonlatin_fuzzySuggester.patch
        168 kB
        Artem Lukanin
      14. nonlatin_fuzzySuggester.patch
        167 kB
        Artem Lukanin
      15. nonlatin_fuzzySuggester.patch
        68 kB
        Artem Lukanin
      16. nonlatin_fuzzySuggester1.patch
        169 kB
        Artem Lukanin
      17. nonlatin_fuzzySuggester2.patch
        135 kB
        Artem Lukanin
      18. nonlatin_fuzzySuggester3.patch
        117 kB
        Artem Lukanin
      19. nonlatin_fuzzySuggester4.patch
        118 kB
        Artem Lukanin
      20. run-suggest-benchmark.patch
        2 kB
        Michael McCandless

        Activity

        Hide
        Artem Lukanin added a comment -

        I've added a test, which demonstrates the bug. I have fixed TokenStreamToAutomaton, but I have no idea, how to update AnalyzingSuggester, which wants bytes instead of chars (ints, which cannot fit a byte).

        Show
        Artem Lukanin added a comment - I've added a test, which demonstrates the bug. I have fixed TokenStreamToAutomaton, but I have no idea, how to update AnalyzingSuggester, which wants bytes instead of chars (ints, which cannot fit a byte).
        Hide
        Michael McCandless added a comment -

        Thanks Artem!

        I think we need to change POS_SEP and HOLE? 256/267 are no longer safe to use?

        Also, I think TokenStreamToAutomaton should operate with Unicode code points, not code units (java's char)? Probably we should fork TS2A to a new class (TokenStreamToUnicodeAutomaton or something?).

        Then, in FuzzySuggester, after calling toLevenshteinAutomata and before calling FSTUtil.intersectPrefixPaths, we need to insert a call to UTF32ToUTF8().convert(automaton), just like in CompiledAutomaton.java. That call will translate the fuzzed-up unicode Automaton into the UTF8 equivalent, and then we should be able to pass that on to intersect.

        Also, I think you'll need to fix that 255 in FuzzySuggester.toLevenshteinAutomata ... in fact, just remove it, so we use the ctor that passes max unicode code point.

        Show
        Michael McCandless added a comment - Thanks Artem! I think we need to change POS_SEP and HOLE? 256/267 are no longer safe to use? Also, I think TokenStreamToAutomaton should operate with Unicode code points, not code units (java's char)? Probably we should fork TS2A to a new class (TokenStreamToUnicodeAutomaton or something?). Then, in FuzzySuggester, after calling toLevenshteinAutomata and before calling FSTUtil.intersectPrefixPaths, we need to insert a call to UTF32ToUTF8().convert(automaton), just like in CompiledAutomaton.java. That call will translate the fuzzed-up unicode Automaton into the UTF8 equivalent, and then we should be able to pass that on to intersect. Also, I think you'll need to fix that 255 in FuzzySuggester.toLevenshteinAutomata ... in fact, just remove it, so we use the ctor that passes max unicode code point.
        Hide
        Artem Lukanin added a comment -

        Now all the tests pass except testRandom when preserveSep is true.

        Michael, can you explain me, how this preserve separator feature works?

        Show
        Artem Lukanin added a comment - Now all the tests pass except testRandom when preserveSep is true. Michael, can you explain me, how this preserve separator feature works?
        Hide
        Michael McCandless added a comment -

        Hmm POS_SEP and HOLE are still eating into Unicode's space (the last 2
        code points)? Maybe, we should just use Integer.MAX_VALUE and
        MAX_VALUE-1?

        In fact I think you'll need to make POS_SEP and HOLE consistent across
        both of the TS2A classes. I think you should just define them in
        TokenStreamToAutomaton.java, and then reference those constants from
        TokenStreamToUnicodeAutomaton.java?

        There's a lot of whitespace noise here ... can you remove those
        changes so we can more easily see the "real" changes? Thanks.

        The preserveSep option means when finding a match we must "respect"
        the boundaries between tokens. It could be you're seeing tests fail
        because POS_SEP was different between the two classes?

        Show
        Michael McCandless added a comment - Hmm POS_SEP and HOLE are still eating into Unicode's space (the last 2 code points)? Maybe, we should just use Integer.MAX_VALUE and MAX_VALUE-1? In fact I think you'll need to make POS_SEP and HOLE consistent across both of the TS2A classes. I think you should just define them in TokenStreamToAutomaton.java, and then reference those constants from TokenStreamToUnicodeAutomaton.java? There's a lot of whitespace noise here ... can you remove those changes so we can more easily see the "real" changes? Thanks. The preserveSep option means when finding a match we must "respect" the boundaries between tokens. It could be you're seeing tests fail because POS_SEP was different between the two classes?
        Hide
        Artem Lukanin added a comment -

        Sorry for autoformatting, I will upload the patch without it.
        Since we use Integer.MAX_VALUE we do not need EscapingTokenStreamToUnicodeAutomaton any more, because the text will not have such a code point?

        Show
        Artem Lukanin added a comment - Sorry for autoformatting, I will upload the patch without it. Since we use Integer.MAX_VALUE we do not need EscapingTokenStreamToUnicodeAutomaton any more, because the text will not have such a code point?
        Hide
        Artem Lukanin added a comment -

        BTW, if I replace it with Integer.MAX_VALUE, UTF32ToUTF8().convert(unicodeAutomaton) will not work any more. I mean it will not convert Integer.MAX_VALUE into a sequence of bytes (it converts it into 1fff bf bf bf). So I guess, we still need EscapingTokenStreamToUnicodeAutomaton.

        Show
        Artem Lukanin added a comment - BTW, if I replace it with Integer.MAX_VALUE, UTF32ToUTF8().convert(unicodeAutomaton) will not work any more. I mean it will not convert Integer.MAX_VALUE into a sequence of bytes (it converts it into 1fff bf bf bf). So I guess, we still need EscapingTokenStreamToUnicodeAutomaton.
        Hide
        Artem Lukanin added a comment -

        the patch without autoformatting

        Show
        Artem Lukanin added a comment - the patch without autoformatting
        Hide
        Artem Lukanin added a comment -

        I see, the patch still has autoformatting of some spaces. Sorry, I guess I cannot stop IntelliJ IDEA doing it

        Show
        Artem Lukanin added a comment - I see, the patch still has autoformatting of some spaces. Sorry, I guess I cannot stop IntelliJ IDEA doing it
        Hide
        Artem Lukanin added a comment -

        with untouched trailing spaces

        Show
        Artem Lukanin added a comment - with untouched trailing spaces
        Hide
        Michael McCandless added a comment -

        Oh, right, we can't just use MAX_VALUE: it must be a valid unicode char since we will send it through UTF32toUTF8.

        Also, it's easiest if that char survives to UTF8 as a single byte to keep replaceSep [relatively] simple.

        Maybe we "steal" two unicode chars? Maybe INFO_SEP (U+001F) and INFO_SEP2 (U+001E), and we document that these chars are not allowed on the input? (We could also try, maybe later as a separate issue, to escape them when they occur, like EscapingTokenStreamToUnicodeAutomaton now does if it sees 0xFF on the input).

        Show
        Michael McCandless added a comment - Oh, right, we can't just use MAX_VALUE: it must be a valid unicode char since we will send it through UTF32toUTF8. Also, it's easiest if that char survives to UTF8 as a single byte to keep replaceSep [relatively] simple. Maybe we "steal" two unicode chars? Maybe INFO_SEP (U+001F) and INFO_SEP2 (U+001E), and we document that these chars are not allowed on the input? (We could also try, maybe later as a separate issue, to escape them when they occur, like EscapingTokenStreamToUnicodeAutomaton now does if it sees 0xFF on the input).
        Hide
        Artem Lukanin added a comment -

        you already have
        private static final int PAYLOAD_SEP = '\u001f';
        in AnalyzingSuggester

        Show
        Artem Lukanin added a comment - you already have private static final int PAYLOAD_SEP = '\u001f'; in AnalyzingSuggester
        Hide
        Artem Lukanin added a comment -

        I have fixed testRandom, which repeats the logic of FuzzySuggester.
        Now all the tests pass.
        Please, review.

        Show
        Artem Lukanin added a comment - I have fixed testRandom, which repeats the logic of FuzzySuggester. Now all the tests pass. Please, review.
        Hide
        Artem Lukanin added a comment -

        I see, that some tests in AnalyzingSuggesterTest fail, so I have to look what's wrong...

        Show
        Artem Lukanin added a comment - I see, that some tests in AnalyzingSuggesterTest fail, so I have to look what's wrong...
        Hide
        Artem Lukanin added a comment -

        now tests in FuzzySuggesterTest and AnalyzingSuggesterTest pass, except for AnalyzingSuggesterTest.testRandom (when preserveSep = true).

        If I enable VERBOSE, I see, that suggestions are correct. I guess, there is a bug in the test, but I cannot find it.

        Can you please review?

        Show
        Artem Lukanin added a comment - now tests in FuzzySuggesterTest and AnalyzingSuggesterTest pass, except for AnalyzingSuggesterTest.testRandom (when preserveSep = true). If I enable VERBOSE, I see, that suggestions are correct. I guess, there is a bug in the test, but I cannot find it. Can you please review?
        Hide
        Robert Muir added a comment -

        I dont think changing SEP_LABEL from a single byte to 4 bytes is necessarily a good idea.

        I think benchmarks (size and speed) should be run on this change before we jump into it, I'm also concerned about the determinization and shit being in the middle of an autosuggest request... this seems like it would be way way too slow.

        Show
        Robert Muir added a comment - I dont think changing SEP_LABEL from a single byte to 4 bytes is necessarily a good idea. I think benchmarks (size and speed) should be run on this change before we jump into it, I'm also concerned about the determinization and shit being in the middle of an autosuggest request... this seems like it would be way way too slow.
        Hide
        Artem Lukanin added a comment -

        Possibly we should change it to INFO_SEP2 (U+001E) as Michael suggested for TokenStreamToAutomaton?
        Do you like 0x10ffff and 0x10fffe separators in TokenStreamToAutomaton? Won't they slow down the process?
        I guess, Michael is the man, who runs benchmarks regularly? I don't know, how to do it...

        Show
        Artem Lukanin added a comment - Possibly we should change it to INFO_SEP2 (U+001E) as Michael suggested for TokenStreamToAutomaton? Do you like 0x10ffff and 0x10fffe separators in TokenStreamToAutomaton? Won't they slow down the process? I guess, Michael is the man, who runs benchmarks regularly? I don't know, how to do it...
        Hide
        Michael McCandless added a comment -

        The easy performance tester to run is
        lucene/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java
        ... we should test that first I think? I can also run one based on
        FreeDB ... the sources are in luceneutil
        (https://code.google.com/a/apache-extras.org/p/luceneutil/ ).

        If the perf hit is too much then one option would be to make it
        optional (whether we count edits in Unicode space UTF-8 space), or
        maybe just another suggester class (FuzzyUnicodeSuggester?).

        I think we can use INFO_SEP: yes, this is used for PAYLOAD_SEP, but
        that only means the incoming surfaceForm cannot contain this char, I
        think? So ... I think we are free to use it in the analyzed form? Or
        did something go wrong when you tried?

        Whichever chars we use (steal), we should add checks that these chars do not
        occur in the input...

        Show
        Michael McCandless added a comment - The easy performance tester to run is lucene/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java ... we should test that first I think? I can also run one based on FreeDB ... the sources are in luceneutil ( https://code.google.com/a/apache-extras.org/p/luceneutil/ ). If the perf hit is too much then one option would be to make it optional (whether we count edits in Unicode space UTF-8 space), or maybe just another suggester class (FuzzyUnicodeSuggester?). I think we can use INFO_SEP: yes, this is used for PAYLOAD_SEP, but that only means the incoming surfaceForm cannot contain this char, I think? So ... I think we are free to use it in the analyzed form? Or did something go wrong when you tried? Whichever chars we use (steal), we should add checks that these chars do not occur in the input...
        Hide
        Artem Lukanin added a comment -

        I ran this command:

        ant -Dtestcase=LookupBenchmarkTest clean test

        and got the same results for the patched and the original version:

        [junit4:junit4] Tests summary: 1 suite, 0 tests
             [echo] 5 slowest tests:
        [junit4:tophints]  22.95s | org.apache.lucene.search.spell.TestSpellChecker
        [junit4:tophints]  22.70s | org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest
        [junit4:tophints]  15.08s | org.apache.lucene.search.suggest.fst.TestSort
        [junit4:tophints]  11.84s | org.apache.lucene.search.suggest.fst.FSTCompletionTest
        [junit4:tophints]  11.24s | org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
        
        Show
        Artem Lukanin added a comment - I ran this command: ant -Dtestcase=LookupBenchmarkTest clean test and got the same results for the patched and the original version: [junit4:junit4] Tests summary: 1 suite, 0 tests [echo] 5 slowest tests: [junit4:tophints] 22.95s | org.apache.lucene.search.spell.TestSpellChecker [junit4:tophints] 22.70s | org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest [junit4:tophints] 15.08s | org.apache.lucene.search.suggest.fst.TestSort [junit4:tophints] 11.84s | org.apache.lucene.search.suggest.fst.FSTCompletionTest [junit4:tophints] 11.24s | org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
        Hide
        Artem Lukanin added a comment -

        I used INFO_SEP and INFO_SEP2 for separators and holes. All the tests pass (I have fixed AnalyzingSuggesterTest.testStolenBytes). The benchmark is improved:

        [junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest
        [junit4:junit4] Completed in 0.04s, 0 tests
        [junit4:junit4] 
        [junit4:junit4] JVM J0:     1.64 ..     2.34 =     0.71s
        [junit4:junit4] Execution time total: 2.36 sec.
        [junit4:junit4] Tests summary: 1 suite, 0 tests
             [echo] 5 slowest tests:
        [junit4:tophints]  22.95s | org.apache.lucene.search.spell.TestSpellChecker
        [junit4:tophints]  15.08s | org.apache.lucene.search.suggest.fst.TestSort
        [junit4:tophints]  13.41s | org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest
        [junit4:tophints]  11.84s | org.apache.lucene.search.suggest.fst.FSTCompletionTest
        [junit4:tophints]  10.78s | org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
        
        Show
        Artem Lukanin added a comment - I used INFO_SEP and INFO_SEP2 for separators and holes. All the tests pass (I have fixed AnalyzingSuggesterTest.testStolenBytes). The benchmark is improved: [junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest [junit4:junit4] Completed in 0.04s, 0 tests [junit4:junit4] [junit4:junit4] JVM J0: 1.64 .. 2.34 = 0.71s [junit4:junit4] Execution time total: 2.36 sec. [junit4:junit4] Tests summary: 1 suite, 0 tests [echo] 5 slowest tests: [junit4:tophints] 22.95s | org.apache.lucene.search.spell.TestSpellChecker [junit4:tophints] 15.08s | org.apache.lucene.search.suggest.fst.TestSort [junit4:tophints] 13.41s | org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest [junit4:tophints] 11.84s | org.apache.lucene.search.suggest.fst.FSTCompletionTest [junit4:tophints] 10.78s | org.apache.lucene.search.suggest.analyzing.FuzzySuggesterTest
        Hide
        Michael McCandless added a comment -

        Hi Artem,

        Sorry, running the LookupBenchmarkTest is tricky ... you need to make temporary changes in 3 places. I'm attaching a patch that should let you run it by just doing "ant test -Dtestcase=LookupBenchmarkTest".

        Show
        Michael McCandless added a comment - Hi Artem, Sorry, running the LookupBenchmarkTest is tricky ... you need to make temporary changes in 3 places. I'm attaching a patch that should let you run it by just doing "ant test -Dtestcase=LookupBenchmarkTest".
        Hide
        Artem Lukanin added a comment -

        OK, in general the performance is worse twice.
        before my patch: Total time: 9 minutes 20 seconds
        after my patch with INFO_SEP and INFO_SEP2: Total time: 18 minutes 31 seconds

        I guess, the reason is UTF32ToUTF8().convert(unicodeAutomaton), so it is better to add correct transitions on the fly... but possibly, you can postpone this for another issue?

        Show
        Artem Lukanin added a comment - OK, in general the performance is worse twice. before my patch: Total time: 9 minutes 20 seconds after my patch with INFO_SEP and INFO_SEP2: Total time: 18 minutes 31 seconds I guess, the reason is UTF32ToUTF8().convert(unicodeAutomaton), so it is better to add correct transitions on the fly... but possibly, you can postpone this for another issue?
        Hide
        Michael McCandless added a comment -

        Hmm can you post the full output of the benchmark? It measures different things in each test case.

        Given this, I guess we should make a separate fuzzy suggester class (that measures edit distance in Unicode code point space)? Or make it a boolean option on the current class ...

        Can you also post your last patch (cutting over to INFO_SEP/2 for the stolen chars)?

        Show
        Michael McCandless added a comment - Hmm can you post the full output of the benchmark? It measures different things in each test case. Given this, I guess we should make a separate fuzzy suggester class (that measures edit distance in Unicode code point space)? Or make it a boolean option on the current class ... Can you also post your last patch (cutting over to INFO_SEP/2 for the stolen chars)?
        Hide
        Artem Lukanin added a comment -

        The last patch with INFO_SEP/2 was posted today 20/Jun/13 at 10:52

        Show
        Artem Lukanin added a comment - The last patch with INFO_SEP/2 was posted today 20/Jun/13 at 10:52
        Hide
        Michael McCandless added a comment -

        Oh, woops, I missed it. Thanks.

        Show
        Michael McCandless added a comment - Oh, woops, I missed it. Thanks.
        Hide
        Artem Lukanin added a comment - - edited

        I'm uploading 3 results of benchmarking:
        1) benchmark-old.txt - before the patch
        2) benchmark-INFO_SEP.txt - after the patch
        3) benchmark-wo_convertion.txt - after the patch, but with commented lines of convertion in several places, like

              //Automaton utf8lookupAutomaton = new UTF32ToUTF8().convert(lookupAutomaton);
              //BasicOperations.determinize(utf8lookupAutomaton);

        because they are useless for words consisting only latin letters.

        As you can see the conversion takes too much time.

        Show
        Artem Lukanin added a comment - - edited I'm uploading 3 results of benchmarking: 1) benchmark-old.txt - before the patch 2) benchmark-INFO_SEP.txt - after the patch 3) benchmark-wo_convertion.txt - after the patch, but with commented lines of convertion in several places, like //Automaton utf8lookupAutomaton = new UTF32ToUTF8().convert(lookupAutomaton); //BasicOperations.determinize(utf8lookupAutomaton); because they are useless for words consisting only latin letters. As you can see the conversion takes too much time.
        Hide
        Artem Lukanin added a comment -

        OK, I will add a new option UNICODE_AWARE = 4, which will switch conversion ON.

        Show
        Artem Lukanin added a comment - OK, I will add a new option UNICODE_AWARE = 4, which will switch conversion ON.
        Hide
        Michael McCandless added a comment -

        I'm a little confused by the results. E.g., how come AnalyzingSuggester changes so much between old (57 kQPS) and INFO_SEP (27 kQPS) and wo_conversion (41 kQPS), at the 2-4 prefixes len? And why are the unrelated impls (TSS, FST, WFST, Jaspell) changing so much? Maybe this is just horrible Hotspot noise?

        Show
        Michael McCandless added a comment - I'm a little confused by the results. E.g., how come AnalyzingSuggester changes so much between old (57 kQPS) and INFO_SEP (27 kQPS) and wo_conversion (41 kQPS), at the 2-4 prefixes len? And why are the unrelated impls (TSS, FST, WFST, Jaspell) changing so much? Maybe this is just horrible Hotspot noise?
        Hide
        Michael McCandless added a comment -

        Oh, duh, the conversion from Unicode -> UTF8, and the determinization, are in AnalyzingSuggester ... so it makes sense that it got slower.

        I agree we should add a UNICODE_AWARE option.

        Show
        Michael McCandless added a comment - Oh, duh, the conversion from Unicode -> UTF8, and the determinization, are in AnalyzingSuggester ... so it makes sense that it got slower. I agree we should add a UNICODE_AWARE option.
        Hide
        Artem Lukanin added a comment -

        I have added UNICODE_AWARE option in Lucene and Solr. Should I create a separate Solr issue for Solr updated files or can I upload a patch for all updated files?

        Show
        Artem Lukanin added a comment - I have added UNICODE_AWARE option in Lucene and Solr. Should I create a separate Solr issue for Solr updated files or can I upload a patch for all updated files?
        Hide
        Artem Lukanin added a comment -

        I have uploaded a lucene/solr combo patch with new UNICODE_AWARE option

        Show
        Artem Lukanin added a comment - I have uploaded a lucene/solr combo patch with new UNICODE_AWARE option
        Hide
        Michael McCandless added a comment -

        Thanks Artem!

        I don't understand why we needed to change
        AnalyzingSuggesterTest.testStolenBytes? That implies something is
        wrong w/ the escaping I think? (Ie, results in that test should not
        change whether SEP is preserved or not). So I'm confused what
        changed...

        Also, I think we don't need the check for SEP_LABEL in
        AnalyzingSuggester.lookup (that throws IllegalArgumentException)? (We
        "escape" this char). But we should check for HOLE and throw
        IllegalArgumentException, since we don't escape it. And could you add
        a test confirming you get that exc if you try to add HOLE? Thanks.

        Show
        Michael McCandless added a comment - Thanks Artem! I don't understand why we needed to change AnalyzingSuggesterTest.testStolenBytes? That implies something is wrong w/ the escaping I think? (Ie, results in that test should not change whether SEP is preserved or not). So I'm confused what changed... Also, I think we don't need the check for SEP_LABEL in AnalyzingSuggester.lookup (that throws IllegalArgumentException)? (We "escape" this char). But we should check for HOLE and throw IllegalArgumentException, since we don't escape it. And could you add a test confirming you get that exc if you try to add HOLE? Thanks.
        Hide
        Artem Lukanin added a comment -

        Sorry, I don't understand, why testStolenBytes worked before. I have restored it and now it fails. Can you please suggest, what wrong I did?

        As I understood, if we do not preserve the separator, 1 token with a separator and 2 tokens (which is actually 1 string with a separator) equals after removing the separator in replaceSep, so we should get 2 results instead of 1 when we do a lookup. No?

        I've added a test for IllegalArgumentException.

        Show
        Artem Lukanin added a comment - Sorry, I don't understand, why testStolenBytes worked before. I have restored it and now it fails. Can you please suggest, what wrong I did? As I understood, if we do not preserve the separator, 1 token with a separator and 2 tokens (which is actually 1 string with a separator) equals after removing the separator in replaceSep, so we should get 2 results instead of 1 when we do a lookup. No? I've added a test for IllegalArgumentException.
        Hide
        Artem Lukanin added a comment - - edited

        I have restored testStolenBytes completely and now all the tests pass (see nonlatin_fuzzySuggester_combo2.patch).

        But I'm not sure, what did you mean by 0xff byte in

        token(new BytesRef(new byte[] {0x61, (byte) 0xff, 0x61}))

        ? Letter ÿ or SEP_LABEL?

        Now it is treated as letter ÿ, but in the previous modification of the test I treated it as SEP_LABEL.

        Show
        Artem Lukanin added a comment - - edited I have restored testStolenBytes completely and now all the tests pass (see nonlatin_fuzzySuggester_combo2.patch ). But I'm not sure, what did you mean by 0xff byte in token( new BytesRef( new byte [] {0x61, ( byte ) 0xff, 0x61})) ? Letter ÿ or SEP_LABEL? Now it is treated as letter ÿ, but in the previous modification of the test I treated it as SEP_LABEL.
        Hide
        Michael McCandless added a comment -

        Hmm, testStolenBytes should be using the 0x1f byte ... the intention
        of the test is to ensure than an incoming token that contains
        SEP_LABEL still works correctly (i.e., that the escaping we do is
        working).

        When I change the 0xff in the patch back to 0x1f I indeed see the
        (unexpected) failure without the PRESERVE_SEP option, which is curious
        because we do no escaping without PRESERVE_SEP.

        OK I see the issue: before, when POS_SEP was 256 and the input space
        was a byte, replaceSep always worked correctly because there was no
        way for any byte input to be confused with POS_SEP. But now that we
        are increasing the input space to all unicode chars, there is not
        "safe" value for POS_SEP.

        OK given all this I think we should stop trying to not-steal the byte:
        I think we should simply declare we steal both 0x1e and 0x1f. This
        means we can remove the escaping code, put back your previous code
        that I had asked you to remove (sorry) that threw IAE on 0x1f (and now
        also 0x1e), remove testStolenBytes, and then improve your new
        testIllegalLookupArgument to also verify 0x1f gets the
        IllegalArgumentException?

        Also, we could maybe eliminate some code dup here, e.g. the two
        toFiniteStrings ... maybe by having TS2A and TS2UA share a base class
        / interface. Hmm, maybe we should just merge TS2UA back into TS2A,
        and add a unicodeAware option to it?

        Show
        Michael McCandless added a comment - Hmm, testStolenBytes should be using the 0x1f byte ... the intention of the test is to ensure than an incoming token that contains SEP_LABEL still works correctly (i.e., that the escaping we do is working). When I change the 0xff in the patch back to 0x1f I indeed see the (unexpected) failure without the PRESERVE_SEP option, which is curious because we do no escaping without PRESERVE_SEP. OK I see the issue: before, when POS_SEP was 256 and the input space was a byte, replaceSep always worked correctly because there was no way for any byte input to be confused with POS_SEP. But now that we are increasing the input space to all unicode chars, there is not "safe" value for POS_SEP. OK given all this I think we should stop trying to not-steal the byte: I think we should simply declare we steal both 0x1e and 0x1f. This means we can remove the escaping code, put back your previous code that I had asked you to remove (sorry) that threw IAE on 0x1f (and now also 0x1e), remove testStolenBytes, and then improve your new testIllegalLookupArgument to also verify 0x1f gets the IllegalArgumentException? Also, we could maybe eliminate some code dup here, e.g. the two toFiniteStrings ... maybe by having TS2A and TS2UA share a base class / interface. Hmm, maybe we should just merge TS2UA back into TS2A, and add a unicodeAware option to it?
        Hide
        Artem Lukanin added a comment -

        Done. Please, review LUCENE-5030.patch

        Show
        Artem Lukanin added a comment - Done. Please, review LUCENE-5030 .patch
        Hide
        Artem Lukanin added a comment -

        BTW, for your

        // TODO: is there a Reader from a CharSequence?

        in AnalyzingSuggester.toLookupAutomaton
        There is org.apache.commons.io.input.CharSequenceReader

        Show
        Artem Lukanin added a comment - BTW, for your // TODO: is there a Reader from a CharSequence? in AnalyzingSuggester.toLookupAutomaton There is org.apache.commons.io.input.CharSequenceReader
        Hide
        Michael McCandless added a comment -

        Thanks Artem, new patch looks good.

        Thanks for the tip about org.apache.commons.io.input.CharSequenceReader! I'll update the TODO with this information, but I don't think we should pull in a dep on commons for this.

        Show
        Michael McCandless added a comment - Thanks Artem, new patch looks good. Thanks for the tip about org.apache.commons.io.input.CharSequenceReader! I'll update the TODO with this information, but I don't think we should pull in a dep on commons for this.
        Hide
        Michael McCandless added a comment -

        I plan to commit the last patch soon ... thanks Artem!

        Show
        Michael McCandless added a comment - I plan to commit the last patch soon ... thanks Artem!
        Hide
        Artem Lukanin added a comment -

        Cool!

        Show
        Artem Lukanin added a comment - Cool!
        Hide
        Robert Muir added a comment -
        +  /** Include this flag in the options parameter to {@link
        +   *  #AnalyzingSuggester(Analyzer,Analyzer,int,int,int)} if
        +   *  you want your suggester to operate non-ASCII letters. */
        +  public static final int UNICODE_AWARE = 4;
        

        Errr... this implies that there is something wrong with AnalyzingSuggester, or that this option actually even does anything at all for AnalyzingSuggester, when in fact it only changes the behavior of FuzzySuggester.

        Can we fix this?

        Show
        Robert Muir added a comment - + /** Include this flag in the options parameter to {@link + * #AnalyzingSuggester(Analyzer,Analyzer,int,int,int)} if + * you want your suggester to operate non-ASCII letters. */ + public static final int UNICODE_AWARE = 4; Errr... this implies that there is something wrong with AnalyzingSuggester, or that this option actually even does anything at all for AnalyzingSuggester, when in fact it only changes the behavior of FuzzySuggester. Can we fix this?
        Hide
        Artem Lukanin added a comment -

        The javadocs are fixed.

        Show
        Artem Lukanin added a comment - The javadocs are fixed.
        Hide
        Michael McCandless added a comment -

        Maybe we should rename UNICODE_AWARE to FUZZY_UNICODE_AWARE? (Because AnalyzingSuggester itself is already unicode aware... so this flag only impacts FuzzySuggester.)

        Show
        Michael McCandless added a comment - Maybe we should rename UNICODE_AWARE to FUZZY_UNICODE_AWARE? (Because AnalyzingSuggester itself is already unicode aware... so this flag only impacts FuzzySuggester.)
        Hide
        Michael McCandless added a comment -

        Hmm also "ant precommit" is failing ...

        Show
        Michael McCandless added a comment - Hmm also "ant precommit" is failing ...
        Hide
        Artem Lukanin added a comment -

        in ant precommit I get this error:
        extra-target.xml:68: javax.script.ScriptException: javax.script.ScriptException: org.tmatesoft.svn.core.SVNException: svn: E155007: 'lucene-solr' is not a working copy

        I use git, not SVN, so I'm a bit confused...

        What error do you get?

        Show
        Artem Lukanin added a comment - in ant precommit I get this error: extra-target.xml:68: javax.script.ScriptException: javax.script.ScriptException: org.tmatesoft.svn.core.SVNException: svn: E155007: 'lucene-solr' is not a working copy I use git, not SVN, so I'm a bit confused... What error do you get?
        Hide
        Michael McCandless added a comment -

        OK no problem, I can fix it. The javadocs linter is angry that isUnicodeAware has no text (only an @return) ...

        Show
        Michael McCandless added a comment - OK no problem, I can fix it. The javadocs linter is angry that isUnicodeAware has no text (only an @return) ...
        Hide
        Michael McCandless added a comment -

        New patch, fixing the linter error, renaming UNICODE_AWARE -> FUZZY_UNICODE_AWARE, and fixing one compilation warning ... I think it's ready.

        Show
        Michael McCandless added a comment - New patch, fixing the linter error, renaming UNICODE_AWARE -> FUZZY_UNICODE_AWARE, and fixing one compilation warning ... I think it's ready.
        Hide
        Artem Lukanin added a comment -

        I have renamed the variables in comments and tests for consistency

        Show
        Artem Lukanin added a comment - I have renamed the variables in comments and tests for consistency
        Hide
        Michael McCandless added a comment -

        Sorry for the long delay here ...

        Just to verify: there is no point to passing FUZZY_UNICODE_AWARE to AnalyzingSuggester, right?

        In which case, I think we the AnalyzingLookupFactory should not be changed?

        But, furthermore, I think we can isolate the changes to FuzzySuggester? E.g., move the FUZZY_UNICODE_AWARE flag down to FuzzySuggester, fix its ctor to strip that option when calling super() and move the isFuzzyUnicodeAware down as well, and then override toLookupAutomaton to do the utf8 conversion + det?

        This way it's not even possible to send the fuzzy flag to AnalyzingSuggester.

        Show
        Michael McCandless added a comment - Sorry for the long delay here ... Just to verify: there is no point to passing FUZZY_UNICODE_AWARE to AnalyzingSuggester, right? In which case, I think we the AnalyzingLookupFactory should not be changed? But, furthermore, I think we can isolate the changes to FuzzySuggester? E.g., move the FUZZY_UNICODE_AWARE flag down to FuzzySuggester, fix its ctor to strip that option when calling super() and move the isFuzzyUnicodeAware down as well, and then override toLookupAutomaton to do the utf8 conversion + det? This way it's not even possible to send the fuzzy flag to AnalyzingSuggester.
        Hide
        Artem Lukanin added a comment -

        Then I have to override (and copy a lot of code) getTokenStreamToAutomaton, lookup, toFiniteStrings and make a lot of private variables protected.
        I think, this is not a good idea.

        Show
        Artem Lukanin added a comment - Then I have to override (and copy a lot of code) getTokenStreamToAutomaton, lookup, toFiniteStrings and make a lot of private variables protected. I think, this is not a good idea.
        Hide
        Artem Lukanin added a comment -

        Moved the parameter from AnalyzingLookupFactory to FuzzyLookupFactory

        Show
        Artem Lukanin added a comment - Moved the parameter from AnalyzingLookupFactory to FuzzyLookupFactory
        Hide
        Artem Lukanin added a comment -

        Michael, I got your idea. I will refactor the code not to use FUZZY_UNICODE_AWARE in AnalyzingSuggester.

        Show
        Artem Lukanin added a comment - Michael, I got your idea. I will refactor the code not to use FUZZY_UNICODE_AWARE in AnalyzingSuggester.
        Hide
        Artem Lukanin added a comment -

        The code is refactored not to touch AnalyzingSuggester. Please, review.

        Show
        Artem Lukanin added a comment - The code is refactored not to touch AnalyzingSuggester. Please, review.
        Hide
        Michael McCandless added a comment -

        Patch looks great! Thanks Artem. No more mixing in of fuzzy-ness
        into AnalyzingSuggester.

        It looks like we are doing the utf8 conversion + det twice per lookup,
        once in convertAutomaton and once in getFullPrefixPaths. But, I think
        this is inevitable: the first conversion is on the "straight"
        automaton, for exactFirst match, and the second one is on the lev
        automaton, for non-exactFirst.

        Really we should only do the first convertAutomaton if exactFirst is
        true ... but this is an optimization so we don't need to fix it now.

        Show
        Michael McCandless added a comment - Patch looks great! Thanks Artem. No more mixing in of fuzzy-ness into AnalyzingSuggester. It looks like we are doing the utf8 conversion + det twice per lookup, once in convertAutomaton and once in getFullPrefixPaths. But, I think this is inevitable: the first conversion is on the "straight" automaton, for exactFirst match, and the second one is on the lev automaton, for non-exactFirst. Really we should only do the first convertAutomaton if exactFirst is true ... but this is an optimization so we don't need to fix it now.
        Hide
        ASF subversion and git services added a comment -

        Commit 1504490 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1504490 ]

        LUCENE-5030: FuzzySuggester can optionally measure edits in Unicode code points instead of UTF8 bytes

        Show
        ASF subversion and git services added a comment - Commit 1504490 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1504490 ] LUCENE-5030 : FuzzySuggester can optionally measure edits in Unicode code points instead of UTF8 bytes
        Hide
        ASF subversion and git services added a comment -

        Commit 1504492 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1504492 ]

        LUCENE-5030: FuzzySuggester can optionally measure edits in Unicode code points instead of UTF8 bytes

        Show
        ASF subversion and git services added a comment - Commit 1504492 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1504492 ] LUCENE-5030 : FuzzySuggester can optionally measure edits in Unicode code points instead of UTF8 bytes
        Hide
        Michael McCandless added a comment -

        OK I committed the last patch with a few small fixes:

        • Added @lucene.experimental to FuzzySuggester
        • Removed the added ctor (so we have just two ctors: the easy one,
          which uses all defaults, and the expert one, where you specify
          everything)
        • Removed System.out.printlns from the test

        Thanks Artem!

        Show
        Michael McCandless added a comment - OK I committed the last patch with a few small fixes: Added @lucene.experimental to FuzzySuggester Removed the added ctor (so we have just two ctors: the easy one, which uses all defaults, and the expert one, where you specify everything) Removed System.out.printlns from the test Thanks Artem!
        Hide
        Uwe Schindler added a comment -

        JUHUUUU! Thanks for heavy committing - it took a long time, but now it is good! Many thanks, Uwe

        Show
        Uwe Schindler added a comment - JUHUUUU! Thanks for heavy committing - it took a long time, but now it is good! Many thanks, Uwe
        Hide
        Artem Lukanin added a comment -

        Great! Thanks for reviewing.

        Show
        Artem Lukanin added a comment - Great! Thanks for reviewing.
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Artem Lukanin
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development