Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Would be nice to have a suggester that can handle some fuzziness (like spell correction) so that it's able to suggest completions that are "near" what you typed.

      As a first go at this, I implemented 1T (ie up to 1 edit, including a transposition), except the first letter must be correct.

      But there is a penalty, ie, the "corrected" suggestion needs to have a much higher freq than the "exact match" suggestion before it can compete.

      Still tons of nocommits, and somehow we should merge this / make it work with analyzing suggester too (LUCENE-3842).

      1. LUCENE-3846.patch
        77 kB
        Michael McCandless
      2. LUCENE-3846.patch
        76 kB
        Michael McCandless
      3. LUCENE-3846.patch
        25 kB
        Michael McCandless
      4. LUCENE-3846.patch
        60 kB
        Simon Willnauer
      5. LUCENE-3846.patch
        60 kB
        Simon Willnauer
      6. LUCENE-3846_fuzzy_analyzing.patch
        11 kB
        Robert Muir
      7. LUCENE-3846.patch
        55 kB
        Simon Willnauer
      8. LUCENE-3846.patch
        55 kB
        Michael McCandless

        Activity

        Michael McCandless created issue -
        Hide
        Michael McCandless added a comment -

        Patch, very much work in progress... there are some hard things to work out still.

        Show
        Michael McCandless added a comment - Patch, very much work in progress... there are some hard things to work out still.
        Michael McCandless made changes -
        Field Original Value New Value
        Attachment LUCENE-3846.patch [ 12517003 ]
        Hide
        Eks Dev added a comment -

        awesome! FST/A went a long way.

        Just a few random toughs, triggered by "... "corrected" suggestion needs to have a much higher freq than the "exact match"..."

        Frequency influence is normally slightly more complicated than "only more popular", depending on search task user is facing. Only more popular helps if we assume user types it wrong and our suggestions dictionary is always right. But in cases where you have user who types it correctly, and collection contains errors you would cut all documents with "fuzzy".

        What I found works pretty good is considering this problem to be of nearest neighbor type. Namely,
        task is to find closest matches to the query. Some are more and some less popular. Take for example a case where user types "black dog" and our collection contains document "blaKC dog", having frequency of blakc much lower than black, "only more popular" would miss this document.

        What works out of the box pretty good is comparing frequency of query word and "candidate" to some reasonable cut-off and classifying them to "HF"/"LF" (high/low frequency) terms. It is based on the fact that typos are normally very seldom (if not, they should be treated as synonyms!). So if user types LF token, probably fuzzy candidate would be HF, and the other way around.

        But as said, it depends what the task is.

        Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in

        • "inserting/deleting silent h should cost less than other letters (thomas vs thomas)"
        • "Phonetics, swap "c" <-> "k" is less evil than default"
        • "inserting s at the end... bug vs bugs"

        Apart from that, I see absolutely nothing more one on earth can do better

        Sorry again for just shooting around with "wish lists" at you guys, my time-schedule really does not permit any serious work in form of patches.

        Show
        Eks Dev added a comment - awesome! FST/A went a long way. Just a few random toughs, triggered by "... "corrected" suggestion needs to have a much higher freq than the "exact match"..." Frequency influence is normally slightly more complicated than "only more popular", depending on search task user is facing. Only more popular helps if we assume user types it wrong and our suggestions dictionary is always right. But in cases where you have user who types it correctly, and collection contains errors you would cut all documents with "fuzzy". What I found works pretty good is considering this problem to be of nearest neighbor type. Namely, task is to find closest matches to the query. Some are more and some less popular. Take for example a case where user types "black dog" and our collection contains document "blaKC dog", having frequency of blakc much lower than black, "only more popular" would miss this document. What works out of the box pretty good is comparing frequency of query word and "candidate" to some reasonable cut-off and classifying them to "HF"/"LF" (high/low frequency) terms. It is based on the fact that typos are normally very seldom (if not, they should be treated as synonyms!). So if user types LF token, probably fuzzy candidate would be HF, and the other way around. But as said, it depends what the task is. Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in "inserting/deleting silent h should cost less than other letters (thomas vs thomas)" "Phonetics, swap "c" <-> "k" is less evil than default" "inserting s at the end... bug vs bugs" Apart from that, I see absolutely nothing more one on earth can do better Sorry again for just shooting around with "wish lists" at you guys, my time-schedule really does not permit any serious work in form of patches.
        Hide
        Eks Dev added a comment -

        awesome! FST/A went a long way.

        Just a few random toughs, triggered by "... "corrected" suggestion needs to have a much higher freq than the "exact match"..."

        Frequency influence is normally slightly more complicated than "only more popular", depending on search task user is facing. Only more popular helps if we assume user types it wrong and our suggestions dictionary is always right. But in cases where you have user who types it correctly, and collection contains errors you would cut all documents with "fuzzy".

        What I found works pretty good is considering this problem to be of nearest neighbor type. Namely,
        task is to find closest matches to the query. Some are more and some less popular. Take for example a case where user types "black dog" and our collection contains document "blaKC dog", having frequency of blakc much lower than black, "only more popular" would miss this document.

        What works out of the box pretty good is comparing frequency of query word and "candidate" to some reasonable cut-off and classifying them to "HF"/"LF" (high/low frequency) terms. It is based on the fact that typos are normally very seldom (if not, they should be treated as synonyms!). So if user types LF token, probably fuzzy candidate would be HF, and the other way around.

        But as said, it depends what the task is.

        Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in

        • "inserting/deleting silent h should cost less than other letters (thomas vs thomas)"
        • "Phonetics, swap "c" <-> "k" is less evil than default"
        • "inserting s at the end... bug vs bugs"

        Apart from that, I see absolutely nothing more one on earth can do better

        Sorry again for just shooting around with "wish lists" at you guys, my time-schedule really does not permit any serious work in form of patches.

        Show
        Eks Dev added a comment - awesome! FST/A went a long way. Just a few random toughs, triggered by "... "corrected" suggestion needs to have a much higher freq than the "exact match"..." Frequency influence is normally slightly more complicated than "only more popular", depending on search task user is facing. Only more popular helps if we assume user types it wrong and our suggestions dictionary is always right. But in cases where you have user who types it correctly, and collection contains errors you would cut all documents with "fuzzy". What I found works pretty good is considering this problem to be of nearest neighbor type. Namely, task is to find closest matches to the query. Some are more and some less popular. Take for example a case where user types "black dog" and our collection contains document "blaKC dog", having frequency of blakc much lower than black, "only more popular" would miss this document. What works out of the box pretty good is comparing frequency of query word and "candidate" to some reasonable cut-off and classifying them to "HF"/"LF" (high/low frequency) terms. It is based on the fact that typos are normally very seldom (if not, they should be treated as synonyms!). So if user types LF token, probably fuzzy candidate would be HF, and the other way around. But as said, it depends what the task is. Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in "inserting/deleting silent h should cost less than other letters (thomas vs thomas)" "Phonetics, swap "c" <-> "k" is less evil than default" "inserting s at the end... bug vs bugs" Apart from that, I see absolutely nothing more one on earth can do better Sorry again for just shooting around with "wish lists" at you guys, my time-schedule really does not permit any serious work in form of patches.
        Hide
        Robert Muir added a comment -

        Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in

        Its probably "good enough" for this suggester to allow someone to re-rank their top-N with a StringDistance http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/spell/StringDistance.java

        such things are language and domain-specific and i think just adding this pluggability will work pretty well, rather than trying to complicate the actual intersection algorithm (which will ultimately never satisfy everyone anyway).

        the default can be "internal metric" which means no re-ranking at all. This is how DirectSpellChecker works for example.

        Show
        Robert Muir added a comment - Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in Its probably "good enough" for this suggester to allow someone to re-rank their top-N with a StringDistance http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/spell/StringDistance.java such things are language and domain-specific and i think just adding this pluggability will work pretty well, rather than trying to complicate the actual intersection algorithm (which will ultimately never satisfy everyone anyway). the default can be "internal metric" which means no re-ranking at all. This is how DirectSpellChecker works for example.
        Hide
        Eks Dev added a comment -

        sure as hell, re-ranking covers most of the cases. If you are not saturating top-N depth, it works just perfect, but if you are saturating top-n, you have to increase depth / number of allowed edits, which in turn hurts performance...

        "rather than trying to complicate the actual intersection algorithm"

        The logic in intersection algorithm would not have to know anything about the language specifics, it would be defined in cost matrix. But suporting cost matrix per edit operation deep down can be complex. You would simply reduce language/domain parametrization to configuration of costs in matrix

        Show
        Eks Dev added a comment - sure as hell, re-ranking covers most of the cases. If you are not saturating top-N depth, it works just perfect, but if you are saturating top-n, you have to increase depth / number of allowed edits, which in turn hurts performance... "rather than trying to complicate the actual intersection algorithm" The logic in intersection algorithm would not have to know anything about the language specifics, it would be defined in cost matrix. But suporting cost matrix per edit operation deep down can be complex. You would simply reduce language/domain parametrization to configuration of costs in matrix
        Hide
        Robert Muir added a comment -

        The logic in intersection algorithm would not have to know anything about the language specifics, it would be defined in cost matrix. But suporting cost matrix per edit operation deep down can be complex. You would simply reduce language/domain parametrization to configuration of costs in matrix

        Like i said, it wont satisfy everyone. Lots of people are going to want ranking thats way more complex than just a cost matrix anyway,
        e.g. works on context, or phonemes, or other things.

        So i think its good to just plug in the re-ranking so people can write whatever StringDistance they want and call it a day.

        Personally i dont think context-free single-character cost-matrixes really help myself, feel free to show me evidence they do

        Show
        Robert Muir added a comment - The logic in intersection algorithm would not have to know anything about the language specifics, it would be defined in cost matrix. But suporting cost matrix per edit operation deep down can be complex. You would simply reduce language/domain parametrization to configuration of costs in matrix Like i said, it wont satisfy everyone. Lots of people are going to want ranking thats way more complex than just a cost matrix anyway, e.g. works on context, or phonemes, or other things. So i think its good to just plug in the re-ranking so people can write whatever StringDistance they want and call it a day. Personally i dont think context-free single-character cost-matrixes really help myself, feel free to show me evidence they do
        Hide
        Michael McCandless added a comment -

        This is a nice benefit of the path-based best-first search (in the patch): it's easy to use a custom cost matrix. The cost can also be context-dependent too (based on past matched characters, though not [easily] future ones).

        We don't need to explore that now, before committing this, but it's nice that we'll have the freedom to do so later.

        Re-ranking definitely adds cost since you'll have to pull a bigger N. I think we'll likely have to somehow combine the cost and "isFuzzy" into a single cost, during the search, not after (reranking). Not sure how to do that yet...

        Show
        Michael McCandless added a comment - This is a nice benefit of the path-based best-first search (in the patch): it's easy to use a custom cost matrix. The cost can also be context-dependent too (based on past matched characters, though not [easily] future ones). We don't need to explore that now, before committing this, but it's nice that we'll have the freedom to do so later. Re-ranking definitely adds cost since you'll have to pull a bigger N. I think we'll likely have to somehow combine the cost and "isFuzzy" into a single cost, during the search, not after (reranking). Not sure how to do that yet...
        Hide
        Eks Dev added a comment -

        feel free to show me evidence they do

        Even here they help a lot, do not underestimate error model! (as in noisy channel, see http://norvig.com/spell-correct.html for a nice overview).

        Examples, off the top of my head:
        in a case you search for Carin in a set

        {Karin, Marin, Darin}

        , (All valid names, at edit distance one) you would prefer to see Karin as a highest (to the only one) ranked fuzzy suggestion. (close consonants).

        Or discount on swap(vowel ,vowel) vs swap(vowel/consonant, consonant). Mistaking one vowel for another is more probable than mistaking two consonants or consonant and vowel (as long as humans type).

        Books, scanned using OCR have no problems with phonetics, but other...

        Context is important, in-word context as part of "error model" (character level context, like previous character) but even more important is the context from the "language model", that normally dominates.

        I could look for some interesting papers in my archives if you are not convinced yet
        This one is worth reading (http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf), tackles, among other things, exactly this topic.

        it's easy to use a custom cost matrix. The cost can also be context-dependent too (based on past matched characters, though not [easily] future ones).

        Great to hear that!
        prefix based context is the only context at sub-word level I ever used. I doubt lookahead brings something.

        Show
        Eks Dev added a comment - feel free to show me evidence they do Even here they help a lot, do not underestimate error model! (as in noisy channel, see http://norvig.com/spell-correct.html for a nice overview). Examples, off the top of my head: in a case you search for Carin in a set {Karin, Marin, Darin} , (All valid names, at edit distance one) you would prefer to see Karin as a highest (to the only one) ranked fuzzy suggestion. (close consonants). Or discount on swap(vowel ,vowel) vs swap(vowel/consonant, consonant). Mistaking one vowel for another is more probable than mistaking two consonants or consonant and vowel (as long as humans type). Books, scanned using OCR have no problems with phonetics, but other... Context is important, in-word context as part of "error model" (character level context, like previous character) but even more important is the context from the "language model", that normally dominates. I could look for some interesting papers in my archives if you are not convinced yet This one is worth reading ( http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf ), tackles, among other things, exactly this topic. it's easy to use a custom cost matrix. The cost can also be context-dependent too (based on past matched characters, though not [easily] future ones). Great to hear that! prefix based context is the only context at sub-word level I ever used. I doubt lookahead brings something.
        Hide
        Robert Muir added a comment -

        I could look for some interesting papers in my archives if you are not convinced yet

        none of that is evidence, those are just examples, I could list examples where
        a cost table hurts, too.

        Evidence is some actual experiment with results showing some cost table has better results.

        Show
        Robert Muir added a comment - I could look for some interesting papers in my archives if you are not convinced yet none of that is evidence, those are just examples, I could list examples where a cost table hurts, too. Evidence is some actual experiment with results showing some cost table has better results.
        Hide
        Eks Dev added a comment -

        Sure, give me enough annotated data and I can give you "close to optimal" cost matrix. There are (rather simple) ways to estimate these costs.

        Or you are trying to argument there is no cost table better than the one filled with ones?

        Show
        Eks Dev added a comment - Sure, give me enough annotated data and I can give you "close to optimal" cost matrix. There are (rather simple) ways to estimate these costs. Or you are trying to argument there is no cost table better than the one filled with ones?
        Hide
        Robert Muir added a comment -

        Or you are trying to argument there is no cost table better than the one filled with ones?

        Yes. there is no evidence to the contrary, so why make things complicated.

        Show
        Robert Muir added a comment - Or you are trying to argument there is no cost table better than the one filled with ones? Yes. there is no evidence to the contrary, so why make things complicated.
        Hide
        Eks Dev added a comment -

        Robert,
        I am not talking from some abstract-theoretical point of view, I made my own experience on nontrivial Lucene datasets that are unfortunately not for sharing. Having possibility to train cost matrices per edit operation brings a lot, but you may have had another experience (different problems, different data...).

        Without specifying concrete task (annotated data), there is no notion of "better", so this argument simply does not help ("show me it is better", "no you show me all ones matrix is better than any other", "no, no..."). It is simply about the experience we made in the past, different opinions.

        I personally would not try this argument with molecular biology teams, and tell them their POM and BLOSUM matrices are worthless or to someone in record linkage community (Lucene was used in this context a lot) or ...

        Show
        Eks Dev added a comment - Robert, I am not talking from some abstract-theoretical point of view, I made my own experience on nontrivial Lucene datasets that are unfortunately not for sharing. Having possibility to train cost matrices per edit operation brings a lot, but you may have had another experience (different problems, different data...). Without specifying concrete task (annotated data), there is no notion of "better", so this argument simply does not help ("show me it is better", "no you show me all ones matrix is better than any other", "no, no..."). It is simply about the experience we made in the past, different opinions. I personally would not try this argument with molecular biology teams, and tell them their POM and BLOSUM matrices are worthless or to someone in record linkage community (Lucene was used in this context a lot) or ...
        Hide
        Robert Muir added a comment -

        I personally would not try this argument with molecular biology teams, and tell them their POM and BLOSUM matrices are worthless or to someone in record linkage community (Lucene was used in this context a lot) or ...

        Thats ok, I would

        I don't think we should complicate already-complicated things unless there is some clear benefit.

        I'm not worried about offending anyone.

        Show
        Robert Muir added a comment - I personally would not try this argument with molecular biology teams, and tell them their POM and BLOSUM matrices are worthless or to someone in record linkage community (Lucene was used in this context a lot) or ... Thats ok, I would I don't think we should complicate already-complicated things unless there is some clear benefit. I'm not worried about offending anyone.
        Hide
        Simon Willnauer added a comment -

        I applied this and fixed some compile errors while reading through the code. I also simplified the build() method and removed some minor things.

        basically updated to trunk

        Show
        Simon Willnauer added a comment - I applied this and fixed some compile errors while reading through the code. I also simplified the build() method and removed some minor things. basically updated to trunk
        Simon Willnauer made changes -
        Attachment LUCENE-3846.patch [ 12517548 ]
        Hide
        Michael McCandless added a comment -

        Thanks Simon!

        Show
        Michael McCandless added a comment - Thanks Simon!
        Michael McCandless made changes -
        Fix Version/s 3.6 [ 12319070 ]
        Robert Muir made changes -
        Fix Version/s 4.1 [ 12321140 ]
        Fix Version/s 4.0 [ 12314025 ]
        Hide
        Robert Muir added a comment -

        here's my hacky Fuzzy+Analyzing prototype.

        But we need to fix intersectPrefixPaths to be able to efficiently intersect transition ranges (e.g. findTargetArc + readNextArc through the range?).

        anyway we should see how slow this is compared to mike's: the advantage would be you would still get all the stuff AnalyzingSuggester has...

        Show
        Robert Muir added a comment - here's my hacky Fuzzy+Analyzing prototype. But we need to fix intersectPrefixPaths to be able to efficiently intersect transition ranges (e.g. findTargetArc + readNextArc through the range?). anyway we should see how slow this is compared to mike's: the advantage would be you would still get all the stuff AnalyzingSuggester has...
        Robert Muir made changes -
        Attachment LUCENE-3846_fuzzy_analyzing.patch [ 12547312 ]
        Hide
        Simon Willnauer added a comment -

        here is a patch that adds the missing intersect method and adds several tests derived from the AnalyzingSuggestorTest. The tests all pass at this point but I do get a weird failure if I run the benchmarks. somehow the TopNSearcher runs into a bad state which I can't really figure out.

        this patch has several refactorings in AnalyzingSuggestor mainly to make testing easier in the fuzzy case (encapuslated some stuff into package private methods etc.) Yet there are tons of nocommits but at least we have something working.

        Regarding the failure, I see a NoSuchELementException from the "queue" in the top N searcher that somehow removed the bottom and tries to pull the last element that doesn't exists. (stacktrace below) Yet, the funky thing is that this doesn't happen if I run this with exactFirst=false but the problem seems to be in the non-exactFirst part (see stacktrace). I use a direct intersection for exactFirst in the fuzzy case so that code is identical to analyzing suggestor since the intersection of the LD automaton doesn't return enough information to tell what is an exact match.

        here is the stacktrace:

        java.util.NoSuchElementException
        	at java.util.TreeMap.key(TreeMap.java:1206)
        	at java.util.TreeMap.lastKey(TreeMap.java:274)
        	at java.util.TreeSet.last(TreeSet.java:384)
        	at org.apache.lucene.util.fst.Util$TopNSearcher.addIfCompetitive(Util.java:339)
        	at org.apache.lucene.util.fst.Util$TopNSearcher.search(Util.java:453)
        	at org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.lookup(AnalyzingSuggester.java:581)
        	at org.apache.lucene.search.suggest.LookupBenchmarkTest$2.call(LookupBenchmarkTest.java:228)
        	at org.apache.lucene.search.suggest.LookupBenchmarkTest$2.call(LookupBenchmarkTest.java:1)
        	at org.apache.lucene.search.suggest.LookupBenchmarkTest.measure(LookupBenchmarkTest.java:253)
        	at org.apache.lucene.search.suggest.LookupBenchmarkTest.runPerformanceTest(LookupBenchmarkTest.java:224)
        	at org.apache.lucene.search.suggest.LookupBenchmarkTest.testPerformanceOnPrefixes6_9(LookupBenchmarkTest.java:192)
        NOTE: reproduce with: ant test  -Dtestcase=LookupBenchmarkTest -Dtests.method=testPerformanceOnPrefixes6_9 -Dtests.seed=B5BAF2A9592263BC -Dtests.locale=fi_FI -Dtests.timezone=Africa/Lagos -Dtests.file.encoding=UTF-8
        NOTE: test params are: codec=Lucene40: {}, sim=DefaultSimilarity, locale=fi_FI, timezone=Africa/Lagos
        NOTE: Linux 2.6.38-16-generic amd64/Sun Microsystems Inc. 1.6.0_26 (64-bit)/cpus=12,threads=1,free=578809008,total=1539571712
        NOTE: All tests run in this JVM: [LookupBenchmarkTest]
        
        

        mike if you get a chance it would be great if you could look into that one?!

        Show
        Simon Willnauer added a comment - here is a patch that adds the missing intersect method and adds several tests derived from the AnalyzingSuggestorTest. The tests all pass at this point but I do get a weird failure if I run the benchmarks. somehow the TopNSearcher runs into a bad state which I can't really figure out. this patch has several refactorings in AnalyzingSuggestor mainly to make testing easier in the fuzzy case (encapuslated some stuff into package private methods etc.) Yet there are tons of nocommits but at least we have something working. Regarding the failure, I see a NoSuchELementException from the "queue" in the top N searcher that somehow removed the bottom and tries to pull the last element that doesn't exists. (stacktrace below) Yet, the funky thing is that this doesn't happen if I run this with exactFirst=false but the problem seems to be in the non-exactFirst part (see stacktrace). I use a direct intersection for exactFirst in the fuzzy case so that code is identical to analyzing suggestor since the intersection of the LD automaton doesn't return enough information to tell what is an exact match. here is the stacktrace: java.util.NoSuchElementException at java.util.TreeMap.key(TreeMap.java:1206) at java.util.TreeMap.lastKey(TreeMap.java:274) at java.util.TreeSet.last(TreeSet.java:384) at org.apache.lucene.util.fst.Util$TopNSearcher.addIfCompetitive(Util.java:339) at org.apache.lucene.util.fst.Util$TopNSearcher.search(Util.java:453) at org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.lookup(AnalyzingSuggester.java:581) at org.apache.lucene.search.suggest.LookupBenchmarkTest$2.call(LookupBenchmarkTest.java:228) at org.apache.lucene.search.suggest.LookupBenchmarkTest$2.call(LookupBenchmarkTest.java:1) at org.apache.lucene.search.suggest.LookupBenchmarkTest.measure(LookupBenchmarkTest.java:253) at org.apache.lucene.search.suggest.LookupBenchmarkTest.runPerformanceTest(LookupBenchmarkTest.java:224) at org.apache.lucene.search.suggest.LookupBenchmarkTest.testPerformanceOnPrefixes6_9(LookupBenchmarkTest.java:192) NOTE: reproduce with: ant test -Dtestcase=LookupBenchmarkTest -Dtests.method=testPerformanceOnPrefixes6_9 -Dtests.seed=B5BAF2A9592263BC -Dtests.locale=fi_FI -Dtests.timezone=Africa/Lagos -Dtests.file.encoding=UTF-8 NOTE: test params are: codec=Lucene40: {}, sim=DefaultSimilarity, locale=fi_FI, timezone=Africa/Lagos NOTE: Linux 2.6.38-16- generic amd64/Sun Microsystems Inc. 1.6.0_26 (64-bit)/cpus=12,threads=1,free=578809008,total=1539571712 NOTE: All tests run in this JVM: [LookupBenchmarkTest] mike if you get a chance it would be great if you could look into that one?!
        Simon Willnauer made changes -
        Attachment LUCENE-3846.patch [ 12548760 ]
        Hide
        Simon Willnauer added a comment -

        updated patch - mike conflicted me in the way

        Show
        Simon Willnauer added a comment - updated patch - mike conflicted me in the way
        Simon Willnauer made changes -
        Attachment LUCENE-3846.patch [ 12548762 ]
        Hide
        Michael McCandless added a comment -

        New patch fixing that exc in benchmark: it's a pre-existing bug in how bottom is set ... if the queue has come empty we just have to set bottom to null.

        I think we should separately fix this... I'll commit (not sure why WFST/AnalyzingSuggester haven't hit this already). It only happens w/ exactFirst because this removes one of the competing topN paths from the queue, and then if there aren't enough suggestions remaining the queue empties before we find the topN results... I'll work up a test.

        Show
        Michael McCandless added a comment - New patch fixing that exc in benchmark: it's a pre-existing bug in how bottom is set ... if the queue has come empty we just have to set bottom to null. I think we should separately fix this... I'll commit (not sure why WFST/AnalyzingSuggester haven't hit this already). It only happens w/ exactFirst because this removes one of the competing topN paths from the queue, and then if there aren't enough suggestions remaining the queue empties before we find the topN results... I'll work up a test.
        Michael McCandless made changes -
        Attachment LUCENE-3846.patch [ 12548772 ]
        Hide
        Michael McCandless added a comment -

        Argh, I forgot to svn add files ... i'll make a branch and commit there.

        Show
        Michael McCandless added a comment - Argh, I forgot to svn add files ... i'll make a branch and commit there.
        Hide
        Michael McCandless added a comment -

        OK branch here: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3846

        I committed the last patch.

        Show
        Michael McCandless added a comment - OK branch here: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3846 I committed the last patch.
        Hide
        Simon Willnauer added a comment -

        just for the record here are my benchmark numbers for the latest branch code:

        Test class requires enabled assertions, enable globally (-ea) or for Solr/Lucene subpackages only: org.apache.lucene.search.suggest.LookupBenchmarkTest
        -- prefixes: 6-9, num: 7, onlyMorePopular: true
        FuzzySuggester  queries: 50001, time[ms]: 4650 [+- 12.56], ~kQPS: 11
        AnalyzingSuggester queries: 50001, time[ms]: 444 [+- 1.89], ~kQPS: 113
        JaspellLookup   queries: 50001, time[ms]: 181 [+- 0.96], ~kQPS: 275
        TSTLookup       queries: 50001, time[ms]: 229 [+- 2.35], ~kQPS: 218
        FSTCompletionLookup queries: 50001, time[ms]: 245 [+- 3.54], ~kQPS: 204
        WFSTCompletionLookup queries: 50001, time[ms]: 121 [+- 1.72], ~kQPS: 413
        -- prefixes: 100-200, num: 7, onlyMorePopular: true
        FuzzySuggester  queries: 50001, time[ms]: 5432 [+- 20.86], ~kQPS: 9
        AnalyzingSuggester queries: 50001, time[ms]: 403 [+- 1.47], ~kQPS: 124
        JaspellLookup   queries: 50001, time[ms]: 129 [+- 1.24], ~kQPS: 389
        TSTLookup       queries: 50001, time[ms]: 68 [+- 4.03], ~kQPS: 739
        FSTCompletionLookup queries: 50001, time[ms]: 254 [+- 2.60], ~kQPS: 197
        WFSTCompletionLookup queries: 50001, time[ms]: 82 [+- 1.03], ~kQPS: 610
        -- construction time
        FuzzySuggester  input: 50001, time[ms]: 450 [+- 1.86]
        AnalyzingSuggester input: 50001, time[ms]: 449 [+- 1.82]
        JaspellLookup   input: 50001, time[ms]: 40 [+- 3.80]
        TSTLookup       input: 50001, time[ms]: 111 [+- 3.33]
        FSTCompletionLookup input: 50001, time[ms]: 213 [+- 4.36]
        WFSTCompletionLookup input: 50001, time[ms]: 156 [+- 2.08]
        -- prefixes: 2-4, num: 7, onlyMorePopular: true
        FuzzySuggester  queries: 50001, time[ms]: 3571 [+- 12.15], ~kQPS: 14
        AnalyzingSuggester queries: 50001, time[ms]: 997 [+- 5.73], ~kQPS: 50
        JaspellLookup   queries: 50001, time[ms]: 494 [+- 2.25], ~kQPS: 101
        TSTLookup       queries: 50001, time[ms]: 1846 [+- 9.67], ~kQPS: 27
        FSTCompletionLookup queries: 50001, time[ms]: 221 [+- 1.57], ~kQPS: 227
        WFSTCompletionLookup queries: 50001, time[ms]: 457 [+- 9.05], ~kQPS: 109
        -- RAM consumption
        FuzzySuggester  size[B]:      889,138
        AnalyzingSuggester size[B]:      889,138
        JaspellLookup   size[B]:    9,815,128
        TSTLookup       size[B]:    9,858,792
        FSTCompletionLookup size[B]:      466,520
        WFSTCompletionLookup size[B]:      507,640
        
        Show
        Simon Willnauer added a comment - just for the record here are my benchmark numbers for the latest branch code: Test class requires enabled assertions, enable globally (-ea) or for Solr/Lucene subpackages only: org.apache.lucene.search.suggest.LookupBenchmarkTest -- prefixes: 6-9, num: 7, onlyMorePopular: true FuzzySuggester queries: 50001, time[ms]: 4650 [+- 12.56], ~kQPS: 11 AnalyzingSuggester queries: 50001, time[ms]: 444 [+- 1.89], ~kQPS: 113 JaspellLookup queries: 50001, time[ms]: 181 [+- 0.96], ~kQPS: 275 TSTLookup queries: 50001, time[ms]: 229 [+- 2.35], ~kQPS: 218 FSTCompletionLookup queries: 50001, time[ms]: 245 [+- 3.54], ~kQPS: 204 WFSTCompletionLookup queries: 50001, time[ms]: 121 [+- 1.72], ~kQPS: 413 -- prefixes: 100-200, num: 7, onlyMorePopular: true FuzzySuggester queries: 50001, time[ms]: 5432 [+- 20.86], ~kQPS: 9 AnalyzingSuggester queries: 50001, time[ms]: 403 [+- 1.47], ~kQPS: 124 JaspellLookup queries: 50001, time[ms]: 129 [+- 1.24], ~kQPS: 389 TSTLookup queries: 50001, time[ms]: 68 [+- 4.03], ~kQPS: 739 FSTCompletionLookup queries: 50001, time[ms]: 254 [+- 2.60], ~kQPS: 197 WFSTCompletionLookup queries: 50001, time[ms]: 82 [+- 1.03], ~kQPS: 610 -- construction time FuzzySuggester input: 50001, time[ms]: 450 [+- 1.86] AnalyzingSuggester input: 50001, time[ms]: 449 [+- 1.82] JaspellLookup input: 50001, time[ms]: 40 [+- 3.80] TSTLookup input: 50001, time[ms]: 111 [+- 3.33] FSTCompletionLookup input: 50001, time[ms]: 213 [+- 4.36] WFSTCompletionLookup input: 50001, time[ms]: 156 [+- 2.08] -- prefixes: 2-4, num: 7, onlyMorePopular: true FuzzySuggester queries: 50001, time[ms]: 3571 [+- 12.15], ~kQPS: 14 AnalyzingSuggester queries: 50001, time[ms]: 997 [+- 5.73], ~kQPS: 50 JaspellLookup queries: 50001, time[ms]: 494 [+- 2.25], ~kQPS: 101 TSTLookup queries: 50001, time[ms]: 1846 [+- 9.67], ~kQPS: 27 FSTCompletionLookup queries: 50001, time[ms]: 221 [+- 1.57], ~kQPS: 227 WFSTCompletionLookup queries: 50001, time[ms]: 457 [+- 9.05], ~kQPS: 109 -- RAM consumption FuzzySuggester size[B]: 889,138 AnalyzingSuggester size[B]: 889,138 JaspellLookup size[B]: 9,815,128 TSTLookup size[B]: 9,858,792 FSTCompletionLookup size[B]: 466,520 WFSTCompletionLookup size[B]: 507,640
        Hide
        Simon Willnauer added a comment -

        here is another benchmark with minPrefix=3 instead of 1 this looks much better:

        -- prefixes: 6-9, num: 7, onlyMorePopular: true
        FuzzySuggester  queries: 50001, time[ms]: 2125 [+- 6.38], ~kQPS: 24
        AnalyzingSuggester queries: 50001, time[ms]: 452 [+- 3.61], ~kQPS: 111
        JaspellLookup   queries: 50001, time[ms]: 187 [+- 1.02], ~kQPS: 267
        TSTLookup       queries: 50001, time[ms]: 263 [+- 1.78], ~kQPS: 190
        FSTCompletionLookup queries: 50001, time[ms]: 269 [+- 1.59], ~kQPS: 186
        WFSTCompletionLookup queries: 50001, time[ms]: 121 [+- 0.75], ~kQPS: 414
        -- prefixes: 100-200, num: 7, onlyMorePopular: true
        FuzzySuggester  queries: 50001, time[ms]: 2778 [+- 16.56], ~kQPS: 18
        AnalyzingSuggester queries: 50001, time[ms]: 414 [+- 1.70], ~kQPS: 121
        JaspellLookup   queries: 50001, time[ms]: 133 [+- 1.85], ~kQPS: 376
        TSTLookup       queries: 50001, time[ms]: 69 [+- 3.41], ~kQPS: 724
        FSTCompletionLookup queries: 50001, time[ms]: 257 [+- 1.79], ~kQPS: 194
        WFSTCompletionLookup queries: 50001, time[ms]: 83 [+- 3.31], ~kQPS: 605
        -- prefixes: 2-4, num: 7, onlyMorePopular: true
        FuzzySuggester  queries: 50001, time[ms]: 1310 [+- 3.30], ~kQPS: 38
        AnalyzingSuggester queries: 50001, time[ms]: 995 [+- 8.03], ~kQPS: 50
        JaspellLookup   queries: 50001, time[ms]: 507 [+- 4.19], ~kQPS: 99
        TSTLookup       queries: 50001, time[ms]: 2148 [+- 16.63], ~kQPS: 23
        FSTCompletionLookup queries: 50001, time[ms]: 223 [+- 2.14], ~kQPS: 224
        WFSTCompletionLookup queries: 50001, time[ms]: 414 [+- 28.44], ~kQPS: 121
        
        Show
        Simon Willnauer added a comment - here is another benchmark with minPrefix=3 instead of 1 this looks much better: -- prefixes: 6-9, num: 7, onlyMorePopular: true FuzzySuggester queries: 50001, time[ms]: 2125 [+- 6.38], ~kQPS: 24 AnalyzingSuggester queries: 50001, time[ms]: 452 [+- 3.61], ~kQPS: 111 JaspellLookup queries: 50001, time[ms]: 187 [+- 1.02], ~kQPS: 267 TSTLookup queries: 50001, time[ms]: 263 [+- 1.78], ~kQPS: 190 FSTCompletionLookup queries: 50001, time[ms]: 269 [+- 1.59], ~kQPS: 186 WFSTCompletionLookup queries: 50001, time[ms]: 121 [+- 0.75], ~kQPS: 414 -- prefixes: 100-200, num: 7, onlyMorePopular: true FuzzySuggester queries: 50001, time[ms]: 2778 [+- 16.56], ~kQPS: 18 AnalyzingSuggester queries: 50001, time[ms]: 414 [+- 1.70], ~kQPS: 121 JaspellLookup queries: 50001, time[ms]: 133 [+- 1.85], ~kQPS: 376 TSTLookup queries: 50001, time[ms]: 69 [+- 3.41], ~kQPS: 724 FSTCompletionLookup queries: 50001, time[ms]: 257 [+- 1.79], ~kQPS: 194 WFSTCompletionLookup queries: 50001, time[ms]: 83 [+- 3.31], ~kQPS: 605 -- prefixes: 2-4, num: 7, onlyMorePopular: true FuzzySuggester queries: 50001, time[ms]: 1310 [+- 3.30], ~kQPS: 38 AnalyzingSuggester queries: 50001, time[ms]: 995 [+- 8.03], ~kQPS: 50 JaspellLookup queries: 50001, time[ms]: 507 [+- 4.19], ~kQPS: 99 TSTLookup queries: 50001, time[ms]: 2148 [+- 16.63], ~kQPS: 23 FSTCompletionLookup queries: 50001, time[ms]: 223 [+- 2.14], ~kQPS: 224 WFSTCompletionLookup queries: 50001, time[ms]: 414 [+- 28.44], ~kQPS: 121
        Hide
        Robert Muir added a comment -

        This looks really good (especially if it holds up for a larger fst!) Thanks for hacking
        at the intersectPrefixPaths to get us going.

        One caveat: the whole thing is optimized for the case where the "query analyzer"
        is very simple, particularly where there are no positionIncrement=0 tokens.

        So if you use a query analyzer with say a synonymsfilter, then you hit the
        "nocommit: how slow can this be?". We should maybe benchmark that. if its
        really horrible we could just throw an exception and document that you cannot
        use synonyms etc in your query analyzer (use them at index-time instead) to
        prevent performance traps...

        Show
        Robert Muir added a comment - This looks really good (especially if it holds up for a larger fst!) Thanks for hacking at the intersectPrefixPaths to get us going. One caveat: the whole thing is optimized for the case where the "query analyzer" is very simple, particularly where there are no positionIncrement=0 tokens. So if you use a query analyzer with say a synonymsfilter, then you hit the "nocommit: how slow can this be?". We should maybe benchmark that. if its really horrible we could just throw an exception and document that you cannot use synonyms etc in your query analyzer (use them at index-time instead) to prevent performance traps...
        Hide
        Simon Willnauer added a comment -

        FYI - I cleaned up the code & tests, synced with trunk, removed no-commits, enabled assertions again and documented the expensiveness of complex query analyzers. From my perspective this is good to go but we should run another round of reviews.

        Show
        Simon Willnauer added a comment - FYI - I cleaned up the code & tests, synced with trunk, removed no-commits, enabled assertions again and documented the expensiveness of complex query analyzers. From my perspective this is good to go but we should run another round of reviews.
        Hide
        Michael McCandless added a comment -

        I think this is ready! I'm attaching the reviewable patch w/ diffs on the branch vs trunk ...

        Show
        Michael McCandless added a comment - I think this is ready! I'm attaching the reviewable patch w/ diffs on the branch vs trunk ...
        Michael McCandless made changes -
        Attachment LUCENE-3846.patch [ 12551195 ]
        Hide
        Simon Willnauer added a comment -

        cool man it looks good. we need a changes entry but from my side this looks good. we can tackle the todos on trunk

        Show
        Simon Willnauer added a comment - cool man it looks good. we need a changes entry but from my side this looks good. we can tackle the todos on trunk
        Hide
        Robert Muir added a comment -

        I think we might want to beast the tests on this before merging!

        Show
        Robert Muir added a comment - I think we might want to beast the tests on this before merging!
        Hide
        Michael McCandless added a comment -

        New patch, w/ CHANGES entry, fixing a couple issues Robert spotted (thanks!).

        Show
        Michael McCandless added a comment - New patch, w/ CHANGES entry, fixing a couple issues Robert spotted (thanks!).
        Michael McCandless made changes -
        Attachment LUCENE-3846.patch [ 12551210 ]
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 5.0 [ 12321663 ]
        Resolution Fixed [ 1 ]
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1403780

        LUCENE-3846: add new FuzzySuggester

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1403780 LUCENE-3846 : add new FuzzySuggester
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development