Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed up the core FuzzyQuery in similar fashion to Wildcard and Regex speedups, maintaining all backwards compatibility.

      The advantages are:

      • we can seek to terms that are useful, instead of brute-forcing the entire terms dict
      • we can determine matches faster, as true/false from a DFA is array lookup, don't even need to run levenshtein.

      We build Levenshtein DFAs in linear time with respect to the length of the word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

      To implement support for 'prefix' length, we simply concatenate two DFAs, which doesn't require us to do NFA->DFA conversion, as the prefix portion is a singleton. the concatenation is also constant time with respect to the size of the fuzzy DFA, it only need examine its start state.

      with this algorithm, parametric tables are precomputed so that DFAs can be constructed very quickly.
      if the required number of edits is too large (we don't have a table for it), we use "dumb mode" at first (no seeking, no DFA, just brute force like now).

      As the priority queue fills up during enumeration, the similarity score required to be a competitive term increases, so, the enum gets faster and faster as this happens. This is because terms in core FuzzyQuery are sorted by boost value, then by term (in lexicographic order).

      For a large term dictionary with a low minimal similarity, you will fill the pq very quickly since you will match many terms.
      This not only provides a mechanism to switch to more efficient DFAs (edit distance of 2 -> edit distance of 1 -> edit distance of 0) during enumeration, but also to switch from "dumb mode" to "smart mode".

      With this design, we can add more DFAs at any time by adding additional tables. The tradeoff is the tables get rather large, so for very high K, we would start to increase the size of Lucene's jar file. The idea is we don't have include large tables for very high K, by using the 'competitive boost' attribute of the priority queue.

      For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton

      1. TestFuzzy.java
        3 kB
        Robert Muir
      2. moman-57f5dc9dd0e7.diff
        3 kB
        Robert Muir
      3. Moman-0.2.1.tar.gz
        18 kB
        Mark Miller
      4. LUCENE-2089.patch
        21 kB
        Robert Muir
      5. LUCENE-2089.patch
        28 kB
        Robert Muir
      6. LUCENE-2089.patch
        29 kB
        Robert Muir
      7. LUCENE-2089.patch
        29 kB
        Robert Muir
      8. LUCENE-2089.patch
        30 kB
        Robert Muir
      9. LUCENE-2089.patch
        257 kB
        Robert Muir
      10. LUCENE-2089.patch
        180 kB
        Robert Muir
      11. LUCENE-2089.patch
        215 kB
        Michael McCandless
      12. LUCENE-2089.patch
        225 kB
        Michael McCandless
      13. LUCENE-2089.patch
        58 kB
        Robert Muir
      14. LUCENE-2089.patch
        221 kB
        Michael McCandless
      15. LUCENE-2089.patch
        224 kB
        Robert Muir
      16. LUCENE-2089.patch
        224 kB
        Robert Muir
      17. LUCENE-2089.patch
        63 kB
        Robert Muir
      18. LUCENE-2089.patch
        66 kB
        Robert Muir
      19. LUCENE-2089.patch
        108 kB
        Robert Muir
      20. LUCENE-2089_concat.patch
        1 kB
        Robert Muir
      21. Lev2ParametricDescription.java
        212 kB
        Michael McCandless
      22. Lev2ParametricDescription.java
        213 kB
        Michael McCandless
      23. Lev2ParametricDescription.java
        214 kB
        Michael McCandless
      24. Lev2ParametricDescription.java
        214 kB
        Michael McCandless
      25. gen.py
        3 kB
        Michael McCandless
      26. gen.py
        5 kB
        Michael McCandless
      27. gen.py
        5 kB
        Michael McCandless
      28. gen.py
        5 kB
        Michael McCandless
      29. gen.py
        5 kB
        Robert Muir
      30. gen.py
        7 kB
        Michael McCandless
      31. createLevAutomata.py
        14 kB
        Michael McCandless
      32. ContrivedFuzzyBenchmark.java
        5 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Yay

          Show
          Michael McCandless added a comment - Yay
          Hide
          Robert Muir added a comment -

          Committed revision 921820.

          Show
          Robert Muir added a comment - Committed revision 921820.
          Hide
          Michael McCandless added a comment -

          YEAH!

          Show
          Michael McCandless added a comment - YEAH!
          Hide
          Uwe Schindler added a comment -

          Phantastic!

          Show
          Uwe Schindler added a comment - Phantastic!
          Hide
          Mark Miller added a comment -

          Sweet!

          Show
          Mark Miller added a comment - Sweet!
          Hide
          Robert Muir added a comment -

          I think this patch is ready.

          Will commit in a few days (to the heavy committing branch) if no one objects.

          Show
          Robert Muir added a comment - I think this patch is ready. Will commit in a few days (to the heavy committing branch) if no one objects.
          Hide
          Robert Muir added a comment -

          Updated patch:

          • optimizations for small indexes
          • improved tests

          the improved test reads in an input file of queries.
          it more-or-less brute forces all fuzzy parameters for binary strings of 2^n

          the test data included was generated from trunk (its only for 2^3).
          I ran much larger n against both trunk and this patch to ensure correctness.
          the code to generate the file is included in the test, in case we need to change it.

          Show
          Robert Muir added a comment - Updated patch: optimizations for small indexes improved tests the improved test reads in an input file of queries. it more-or-less brute forces all fuzzy parameters for binary strings of 2^n the test data included was generated from trunk (its only for 2^3). I ran much larger n against both trunk and this patch to ensure correctness. the code to generate the file is included in the test, in case we need to change it.
          Hide
          Robert Muir added a comment -

          updated patch, regenerated with mike's update to the python script.
          I also added an optional ant task createLevAutomata to build.xml.
          This does a mercurial clone/pull of rev 115 and regenerates the tables.

          Show
          Robert Muir added a comment - updated patch, regenerated with mike's update to the python script. I also added an optional ant task createLevAutomata to build.xml. This does a mercurial clone/pull of rev 115 and regenerates the tables.
          Hide
          Robert Muir added a comment -

          Thanks Mike, I'll regen tonight and pass thru the code another time looking for more cleanups, and upload a new patch.

          Show
          Robert Muir added a comment - Thanks Mike, I'll regen tonight and pass thru the code another time looking for more cleanups, and upload a new patch.
          Hide
          Michael McCandless added a comment -

          Added usage line, fixed missing space in license header, added URLs referencing moman/finenight package, always append L in the long[] static values.

          Show
          Michael McCandless added a comment - Added usage line, fixed missing space in license header, added URLs referencing moman/finenight package, always append L in the long[] static values.
          Hide
          Robert Muir added a comment -

          first shot at a committable patch. please review.

          Show
          Robert Muir added a comment - first shot at a committable patch. please review.
          Hide
          Robert Muir added a comment -

          So eg maybe they could check in a LICENSE file somewhere?

          fyi, the MIT LICENSE.TXT was just committed to the src repo:
          http://bitbucket.org/jpbarrette/moman/changeset/ef6f18799428/raw/moman-ef6f18799428.diff

          Show
          Robert Muir added a comment - So eg maybe they could check in a LICENSE file somewhere? fyi, the MIT LICENSE.TXT was just committed to the src repo: http://bitbucket.org/jpbarrette/moman/changeset/ef6f18799428/raw/moman-ef6f18799428.diff
          Hide
          Michael McCandless added a comment -

          Mercurial (hg) is just the source control system they use. It competes with git, but, it's (hg's) written in Python

          So eg maybe they could check in a LICENSE file somewhere?

          Show
          Michael McCandless added a comment - Mercurial (hg) is just the source control system they use. It competes with git, but, it's (hg's) written in Python So eg maybe they could check in a LICENSE file somewhere?
          Hide
          Robert Muir added a comment -

          I don't see the [MIT] license on that public URL?

          I do not even know Mercurial enough to know if such a thing is possible?

          Show
          Robert Muir added a comment - I don't see the [MIT] license on that public URL? I do not even know Mercurial enough to know if such a thing is possible?
          Hide
          Michael McCandless added a comment -

          This is awesome progress Robert!!

          I also applied the patch and see the tests now passing.

          I don't see the [MIT] license on that public URL?

          Show
          Michael McCandless added a comment - This is awesome progress Robert!! I also applied the patch and see the tests now passing. I don't see the [MIT] license on that public URL?
          Hide
          Robert Muir added a comment -

          attached is the patch the author provided to the moman source code that fixes the bug.

          Show
          Robert Muir added a comment - attached is the patch the author provided to the moman source code that fixes the bug.
          Hide
          Robert Muir added a comment -

          the author responded with a bugfix to moman.
          additionally, he made his code public at http://bitbucket.org/jpbarrette/moman/overview/

          attached is a patch, regen'ed code from his bugfix.
          all Lev1, Lev2, and Lev3 tests pass.

          Show
          Robert Muir added a comment - the author responded with a bugfix to moman. additionally, he made his code public at http://bitbucket.org/jpbarrette/moman/overview/ attached is a patch, regen'ed code from his bugfix. all Lev1, Lev2, and Lev3 tests pass.
          Hide
          Robert Muir added a comment - - edited

          this is mike's patch, except with the additional tests (that fail for lev2/lev3 due to the off-by-one bug).

          edit: found the bug, it is in moman, the author is taking a look. the problem was i ran tests against moman with oflazer, (a slower alternative algorithm), and the bug in moman doesn't happen there.

          Show
          Robert Muir added a comment - - edited this is mike's patch, except with the additional tests (that fail for lev2/lev3 due to the off-by-one bug). edit: found the bug, it is in moman, the author is taking a look. the problem was i ran tests against moman with oflazer, (a slower alternative algorithm), and the bug in moman doesn't happen there.
          Hide
          Michael McCandless added a comment -

          New patch w/ fixes to not overallocate states.

          Show
          Michael McCandless added a comment - New patch w/ fixes to not overallocate states.
          Hide
          Robert Muir added a comment -

          attached is an update patch, i left the Lev3Parametric out for simplicity.

          i added junit testing to exhaustively test all possible characteristic vectors.
          our n=1 is fine, there is a bug in our n=2 generation.

          Show
          Robert Muir added a comment - attached is an update patch, i left the Lev3Parametric out for simplicity. i added junit testing to exhaustively test all possible characteristic vectors. our n=1 is fine, there is a bug in our n=2 generation.
          Hide
          Robert Muir added a comment -

          Mike, this is awesome. I ran benchmarks: we are just as fast as before (with only Lev1 and Lev2 enabled), but with smaller generated code.
          When i turn on Lev3, it speeds up the worst-case ones (no prefix, pq=1024, fuzzy of n=3, n=4), but slows down some of the "better-case" n=3/n=4 cases where there is a prefix or PQ.

          I think this is because the benchmark is contrived, but realistically n=3 (with seeking!) should be a win for users. A less-contrived benchmark (a 'typical' massive term dictionary) would help for tuning.

          separately, I think we can add heuristics: e.g. for n > 3 WITH a prefix, use the DFA in "linear mode" until you drop to n=2, as you already have a nice prefix anyway, stuff like that. But if the user doesn't supply a prefix, i think seeking is always a win.

          Here are the results anyway: I ran it many times and its consistent (obviously differences of just a few MS are not significant). I bolded the ones i think illustrate the differences I am talking about.

          Its cool to be at the point where we are actually able to measure these kinds of tradeoffs!

          Minimum Sim = 0.73f (edit distance of 1)

          Prefix Length PQ Size Avg MS (flex trunk) Avg MS (1,2) Avg MS (1,2,3)
          0 1024 3286.0 7.8 7.6
          0 64 3320.4 7.6 8.0
          1 1024 316.8 5.6 5.3
          1 64 314.3 5.6 5.2
          2 1024 31.8 3.8 4.2
          2 64 31.9 3.7 4.5

          Minimum Sim = 0.58f (edit distance of 2)

          Prefix Length PQ Size Avg MS (flex trunk) Avg MS (1,2) Avg MS (1,2,3)
          0 1024 4223.3 87.7 91.2
          0 64 4199.7 12.6 13.2
          1 1024 430.1 56.4 62.0
          1 64 392.8 9.3 8.5
          2 1024 82.5 45.5 48.0
          2 64 38.4 6.2 6.3

          Minimum Sim = 0.43f (edit distance of 3)

          Prefix Length PQ Size Avg MS (flex trunk) Avg MS (1,2) Avg MS (1,2,3)
          0 1024 5299.9 424.0 199.8
          0 64 5231.8 54.1 93.2
          1 1024 522.9 103.6 107.9
          1 64 480.9 14.5 49.3
          2 1024 89.0 67.9 70.8
          2 64 46.3 6.8 19.7

          Minimum Sim = 0.29f (edit distance of 4)

          Prefix Length PQ Size Avg MS (flex trunk) Avg MS (1,2) Avg MS (1,2,3)
          0 1024 6258.1 363.7 206.5
          0 64 6247.6 75.6 78.8
          1 1024 609.9 108.3 110.0
          1 64 567.1 13.3 45.5
          2 1024 98.6 66.6 73.8
          2 64 55.6 6.8 22.3
          Show
          Robert Muir added a comment - Mike, this is awesome. I ran benchmarks: we are just as fast as before (with only Lev1 and Lev2 enabled), but with smaller generated code. When i turn on Lev3, it speeds up the worst-case ones (no prefix, pq=1024, fuzzy of n=3, n=4), but slows down some of the "better-case" n=3/n=4 cases where there is a prefix or PQ. I think this is because the benchmark is contrived, but realistically n=3 (with seeking!) should be a win for users. A less-contrived benchmark (a 'typical' massive term dictionary) would help for tuning. separately, I think we can add heuristics: e.g. for n > 3 WITH a prefix, use the DFA in "linear mode" until you drop to n=2, as you already have a nice prefix anyway, stuff like that. But if the user doesn't supply a prefix, i think seeking is always a win. Here are the results anyway: I ran it many times and its consistent (obviously differences of just a few MS are not significant). I bolded the ones i think illustrate the differences I am talking about. Its cool to be at the point where we are actually able to measure these kinds of tradeoffs! Minimum Sim = 0.73f (edit distance of 1) Prefix Length PQ Size Avg MS (flex trunk) Avg MS (1,2) Avg MS (1,2,3) 0 1024 3286.0 7.8 7.6 0 64 3320.4 7.6 8.0 1 1024 316.8 5.6 5.3 1 64 314.3 5.6 5.2 2 1024 31.8 3.8 4.2 2 64 31.9 3.7 4.5 Minimum Sim = 0.58f (edit distance of 2) Prefix Length PQ Size Avg MS (flex trunk) Avg MS (1,2) Avg MS (1,2,3) 0 1024 4223.3 87.7 91.2 0 64 4199.7 12.6 13.2 1 1024 430.1 56.4 62.0 1 64 392.8 9.3 8.5 2 1024 82.5 45.5 48.0 2 64 38.4 6.2 6.3 Minimum Sim = 0.43f (edit distance of 3) Prefix Length PQ Size Avg MS (flex trunk) Avg MS (1,2) Avg MS (1,2,3) 0 1024 5299.9 424.0 199.8 0 64 5231.8 54.1 93.2 1 1024 522.9 103.6 107.9 1 64 480.9 14.5 49.3 2 1024 89.0 67.9 70.8 2 64 46.3 6.8 19.7 Minimum Sim = 0.29f (edit distance of 4) Prefix Length PQ Size Avg MS (flex trunk) Avg MS (1,2) Avg MS (1,2,3) 0 1024 6258.1 363.7 206.5 0 64 6247.6 75.6 78.8 1 1024 609.9 108.3 110.0 1 64 567.1 13.3 45.5 2 1024 98.6 66.6 73.8 2 64 55.6 6.8 22.3
          Hide
          Michael McCandless added a comment -

          New patch... just fixes a few small things Uwe noticed (moved unpack method & MASKS up to super class; use newlines to make the massive tables a bit more friendly to look at).

          Show
          Michael McCandless added a comment - New patch... just fixes a few small things Uwe noticed (moved unpack method & MASKS up to super class; use newlines to make the massive tables a bit more friendly to look at).
          Hide
          Michael McCandless added a comment -

          New rev of gen.py, that uses packed arrays for the states/offsets.

          It's much more compact – Lev1 is now 5KB, Lev2 is 11KB, Lev3 is 160KB. And Lev3 compiles! (Robert now you need a test case for Lev3 ). The class files are OK too: Lev1 3.9KB, Lev2 is 7.3KB, Lev3 is 102KB.

          Show
          Michael McCandless added a comment - New rev of gen.py, that uses packed arrays for the states/offsets. It's much more compact – Lev1 is now 5KB, Lev2 is 11KB, Lev3 is 160KB. And Lev3 compiles! (Robert now you need a test case for Lev3 ). The class files are OK too: Lev1 3.9KB, Lev2 is 7.3KB, Lev3 is 102KB.
          Hide
          Robert Muir added a comment -

          attached is an updated (smaller) patch. the generated files are re-generated with Mike's fallthru optimization.

          Show
          Robert Muir added a comment - attached is an updated (smaller) patch. the generated files are re-generated with Mike's fallthru optimization.
          Hide
          Michael McCandless added a comment -

          Collapse switch cases that have the same action (just fall through). Also gen directly to the right location, and handle when Moman package is "installed". This brings byte code size from lev2 down from ~43 KB to ~25 KB.

          Show
          Michael McCandless added a comment - Collapse switch cases that have the same action (just fall through). Also gen directly to the right location, and handle when Moman package is "installed". This brings byte code size from lev2 down from ~43 KB to ~25 KB.
          Hide
          Robert Muir added a comment -

          We should also change the default fuzzy rewrite with this (eg reduce default max # terms from 1024)?

          I thought that once we hammer out this issue, with no backwards breaks, we could then separately discuss ways to improve defaults.
          One easy way would be to use Version in QueryParser to produce better defaults for the 3.1 release, which wouldnt break any backwards compatibility.

          Show
          Robert Muir added a comment - We should also change the default fuzzy rewrite with this (eg reduce default max # terms from 1024)? I thought that once we hammer out this issue, with no backwards breaks, we could then separately discuss ways to improve defaults. One easy way would be to use Version in QueryParser to produce better defaults for the 3.1 release, which wouldnt break any backwards compatibility.
          Hide
          Robert Muir added a comment -

          attached is a monster patch for flex, with autogenerated Lev1 and Lev2 from the gen.py i just uploaded.

          Show
          Robert Muir added a comment - attached is a monster patch for flex, with autogenerated Lev1 and Lev2 from the gen.py i just uploaded.
          Hide
          Robert Muir added a comment -

          Mike, attached is a first crack at modified version of gen.py
          This is producing Lev(1,2) code that passes all tests (I also added random tests to junit for n=1 so far)

          Changes:

          • @override, license note, stuff like that
          • getPosition() should return the index in the word that the state is associated with, this is the offset
          • there are w + 1, not w absolute states per parametric state.
          • in the ctor, the number of states to initialize per parametric state depends on the number of positions this parametric state represents (the length of the python list). So the inner loop here is modified to:
            for(int j=0;j<((w+1)*stateSizes[i]);j++)
            

          if you get a chance, can you review? I don't think its the final version but we can iterate slowly.

          Show
          Robert Muir added a comment - Mike, attached is a first crack at modified version of gen.py This is producing Lev(1,2) code that passes all tests (I also added random tests to junit for n=1 so far) Changes: @override, license note, stuff like that getPosition() should return the index in the word that the state is associated with, this is the offset there are w + 1, not w absolute states per parametric state. in the ctor, the number of states to initialize per parametric state depends on the number of positions this parametric state represents (the length of the python list). So the inner loop here is modified to: for ( int j=0;j<((w+1)*stateSizes[i]);j++) if you get a chance, can you review? I don't think its the final version but we can iterate slowly.
          Hide
          Michael McCandless added a comment -

          Great progress! These speedups are incredible, Robert.

          We should also change the default fuzzy rewrite with this (eg reduce default max # terms from 1024)?

          Show
          Michael McCandless added a comment - Great progress! These speedups are incredible, Robert. We should also change the default fuzzy rewrite with this (eg reduce default max # terms from 1024)?
          Hide
          Robert Muir added a comment -

          based on the work mike provided here (slightly modified), N=2 appears to work correctly (all correct results so far). I will figure out a way to exhaustively test this beast

          For now, here are the updated benchmarks, again same 10M terms, on my regular old HDD, you can scroll up to see how big of a difference N=2 makes, compared to only having N=1 available!

          I think with N=2 implemented, its also clear, that PQ size can actually be more important than using a prefix. For instance, I'm very happy with the performance here with a smaller PQ size of 64.

          Minimum Sim = 0.73f (edit distance of 1)

          Prefix Length PQ Size Avg MS (old) Avg MS (new)
          0 1024 3286.0 7.8
          0 64 3320.4 7.6
          1 1024 316.8 5.6
          1 64 314.3 5.6
          2 1024 31.8 3.8
          2 64 31.9 3.7

          Minimum Sim = 0.58f (edit distance of 2)

          Prefix Length PQ Size Avg MS (old) Avg MS (new)
          0 1024 4223.3 87.7
          0 64 4199.7 12.6
          1 1024 430.1 56.4
          1 64 392.8 9.3
          2 1024 82.5 45.5
          2 64 38.4 6.2

          Minimum Sim = 0.43f (edit distance of 3)

          Prefix Length PQ Size Avg MS (old) Avg MS (new)
          0 1024 5299.9 424.0
          0 64 5231.8 54.1
          1 1024 522.9 103.6
          1 64 480.9 14.5
          2 1024 89.0 67.9
          2 64 46.3 6.8

          Minimum Sim = 0.29f (edit distance of 4)

          Prefix Length PQ Size Avg MS (old) Avg MS (new)
          0 1024 6258.1 363.7
          0 64 6247.6 75.6
          1 1024 609.9 108.3
          1 64 567.1 13.3
          2 1024 98.6 66.6
          2 64 55.6 6.8
          Show
          Robert Muir added a comment - based on the work mike provided here (slightly modified), N=2 appears to work correctly (all correct results so far). I will figure out a way to exhaustively test this beast For now, here are the updated benchmarks, again same 10M terms, on my regular old HDD, you can scroll up to see how big of a difference N=2 makes, compared to only having N=1 available! I think with N=2 implemented, its also clear, that PQ size can actually be more important than using a prefix. For instance, I'm very happy with the performance here with a smaller PQ size of 64. Minimum Sim = 0.73f (edit distance of 1) Prefix Length PQ Size Avg MS (old) Avg MS (new) 0 1024 3286.0 7.8 0 64 3320.4 7.6 1 1024 316.8 5.6 1 64 314.3 5.6 2 1024 31.8 3.8 2 64 31.9 3.7 Minimum Sim = 0.58f (edit distance of 2) Prefix Length PQ Size Avg MS (old) Avg MS (new) 0 1024 4223.3 87.7 0 64 4199.7 12.6 1 1024 430.1 56.4 1 64 392.8 9.3 2 1024 82.5 45.5 2 64 38.4 6.2 Minimum Sim = 0.43f (edit distance of 3) Prefix Length PQ Size Avg MS (old) Avg MS (new) 0 1024 5299.9 424.0 0 64 5231.8 54.1 1 1024 522.9 103.6 1 64 480.9 14.5 2 1024 89.0 67.9 2 64 46.3 6.8 Minimum Sim = 0.29f (edit distance of 4) Prefix Length PQ Size Avg MS (old) Avg MS (new) 0 1024 6258.1 363.7 0 64 6247.6 75.6 1 1024 609.9 108.3 1 64 567.1 13.3 2 1024 98.6 66.6 2 64 55.6 6.8
          Hide
          Robert Muir added a comment -

          Mike, i made some modifications (mostly places where i misled you), and i have your generator producing correct DFA for n=1, (the tests pass) which is a great sign.

          will try to upload a new python script tonight and explain what changed.

          Show
          Robert Muir added a comment - Mike, i made some modifications (mostly places where i misled you), and i have your generator producing correct DFA for n=1, (the tests pass) which is a great sign. will try to upload a new python script tonight and explain what changed.
          Hide
          Michael McCandless added a comment -

          New gen.py, just wires the init state to state 0.

          Show
          Michael McCandless added a comment - New gen.py, just wires the init state to state 0.
          Hide
          Michael McCandless added a comment -

          New patch attached, implemented isAccept. It compiles but I'm not sure it's right!

          Show
          Michael McCandless added a comment - New patch attached, implemented isAccept. It compiles but I'm not sure it's right!
          Hide
          Michael McCandless added a comment -

          New gen.py and Lev2 attached – adds the size() and getPosition() methods.

          The resulting java code at least compiles, but I have no idea if it's working!

          Show
          Michael McCandless added a comment - New gen.py and Lev2 attached – adds the size() and getPosition() methods. The resulting java code at least compiles, but I have no idea if it's working!
          Hide
          Robert Muir added a comment -

          Mike, this is awesome!

          We can use the junit test case to test N=1 once we get to a nice place with this.
          The way it works is, it builds an NFA for N=1, and compares it with the results of this with Automaton.equals, which ensures they accept the same language.
          The test already checks this for all possible characteristic vectors, so if you believe the paper, and the tests pass, then its correct for all strings.

          Testing the correctness of N=2 is harder, we can use the same principles I think, but no automaton.equals as I don't know how to generate an NFA for N=2, even slowly.
          instead I think we will have to verify against the actual levenshtein distance formula, but i think that verifying for all permutations of an alphabet of size 2n+1, for a string of length at least 2n+1 should be sufficient.

          (i plan to also randomly brute-force test the new query against the old fuzzy query at some point, in any case)

          in my opinion we should take the lessons learned from N=2 and if successful, regenerate N=1 too, as the way I "keyed it in" is likely not the best or most compact.

          Show
          Robert Muir added a comment - Mike, this is awesome! We can use the junit test case to test N=1 once we get to a nice place with this. The way it works is, it builds an NFA for N=1, and compares it with the results of this with Automaton.equals, which ensures they accept the same language. The test already checks this for all possible characteristic vectors, so if you believe the paper, and the tests pass, then its correct for all strings. Testing the correctness of N=2 is harder, we can use the same principles I think, but no automaton.equals as I don't know how to generate an NFA for N=2, even slowly. instead I think we will have to verify against the actual levenshtein distance formula, but i think that verifying for all permutations of an alphabet of size 2n+1, for a string of length at least 2n+1 should be sufficient. (i plan to also randomly brute-force test the new query against the old fuzzy query at some point, in any case) in my opinion we should take the lessons learned from N=2 and if successful, regenerate N=1 too, as the way I "keyed it in" is likely not the best or most compact.
          Hide
          Michael McCandless added a comment -

          Attached gen.py, to generate the java code for the parametric DFA.

          I also attached the resulting code fragment for the N=2 case (6544 lines, 212.4 KB!).

          This is only for the transition function, and it's not complete – we need to fixup how state/offset are encoded into a single outer "int state". But I think it's close...

          Show
          Michael McCandless added a comment - Attached gen.py, to generate the java code for the parametric DFA. I also attached the resulting code fragment for the N=2 case (6544 lines, 212.4 KB!). This is only for the transition function, and it's not complete – we need to fixup how state/offset are encoded into a single outer "int state". But I think it's close...
          Hide
          Robert Muir added a comment -

          edit the description, to hopefully be simpler.

          Show
          Robert Muir added a comment - edit the description, to hopefully be simpler.
          Hide
          Robert Muir added a comment -

          attached is a 'contrived fuzzy benchmark' derived from my wildcard benchmark (randomly generated 7-digit terms)

          for the benchmark, i ran results for various combinations of minimum similarity, prefix length, and pq size for the test index of 10million terms.

          Avg MS old is the current flex branch. Avg MS new is with the patch.

          Notes:

          • only the table for distance n=1 is implemented yet!
          • n=1 is fast.
          • Use of the PQ boost attribute speeds up fuzzy queries for higher n slightly, too.
          • adding a table for n=2 should be extremely helpful, and maybe even enough for the default PQ size of 1024 (BQ.maxClauseCount), to make all fuzzy queries reasonable.

          Minimum Sim = 0.73f (edit distance of 1)

          Prefix Length PQ Size Avg MS (old) Avg MS (new)
          0 1024 3286.0 10.6
          0 64 3320.4 7.2
          1 1024 316.8 5.3
          1 64 314.3 5.3
          2 1024 31.8 4.0
          2 64 31.9 4.2

          Minimum Sim = 0.58f (edit distance of 2)

          Prefix Length PQ Size Avg MS (old) Avg MS (new)
          0 1024 4223.3 1341.6
          0 64 4199.7 501.9
          1 1024 430.1 304.1
          1 64 392.8 44.7
          2 1024 82.5 70.0
          2 64 38.4 7.7

          Minimum Sim = 0.43f (edit distance of 3)

          Prefix Length PQ Size Avg MS (old) Avg MS (new)
          0 1024 5299.9 2617.0
          0 64 5231.8 476.4
          1 1024 522.9 318.9
          1 64 480.9 73.9
          2 1024 89.0 83.9
          2 64 46.3 8.6

          Minimum Sim = 0.29f (edit distance of 4)

          Prefix Length PQ Size Avg MS (old) Avg MS (new)
          0 1024 6258.1 3114.0
          0 64 6247.6 684.6
          1 1024 609.9 380.0
          1 64 567.1 69.3
          2 1024 98.6 93.8
          2 64 55.6 11.4
          Show
          Robert Muir added a comment - attached is a 'contrived fuzzy benchmark' derived from my wildcard benchmark (randomly generated 7-digit terms) for the benchmark, i ran results for various combinations of minimum similarity, prefix length, and pq size for the test index of 10million terms. Avg MS old is the current flex branch. Avg MS new is with the patch. Notes: only the table for distance n=1 is implemented yet! n=1 is fast. Use of the PQ boost attribute speeds up fuzzy queries for higher n slightly, too. adding a table for n=2 should be extremely helpful, and maybe even enough for the default PQ size of 1024 (BQ.maxClauseCount), to make all fuzzy queries reasonable. Minimum Sim = 0.73f (edit distance of 1) Prefix Length PQ Size Avg MS (old) Avg MS (new) 0 1024 3286.0 10.6 0 64 3320.4 7.2 1 1024 316.8 5.3 1 64 314.3 5.3 2 1024 31.8 4.0 2 64 31.9 4.2 Minimum Sim = 0.58f (edit distance of 2) Prefix Length PQ Size Avg MS (old) Avg MS (new) 0 1024 4223.3 1341.6 0 64 4199.7 501.9 1 1024 430.1 304.1 1 64 392.8 44.7 2 1024 82.5 70.0 2 64 38.4 7.7 Minimum Sim = 0.43f (edit distance of 3) Prefix Length PQ Size Avg MS (old) Avg MS (new) 0 1024 5299.9 2617.0 0 64 5231.8 476.4 1 1024 522.9 318.9 1 64 480.9 73.9 2 1024 89.0 83.9 2 64 46.3 8.6 Minimum Sim = 0.29f (edit distance of 4) Prefix Length PQ Size Avg MS (old) Avg MS (new) 0 1024 6258.1 3114.0 0 64 6247.6 684.6 1 1024 609.9 380.0 1 64 567.1 69.3 2 1024 98.6 93.8 2 64 55.6 11.4
          Hide
          Robert Muir added a comment -
          • implement the pq algorithm, when the value at the bottom of the pq changes (BoostAttribute maxCompetitiveBoost), the enum adjusts itself by decreasing edit distance, and swapping in more efficient code.
          • remove the wasted prefix checks in automatonfuzzytermsenum, as Uwe noticed, because its not necessary and handled as part of the DFA itself (it will never seek to such terms).

          here is a patch, which is complete... needs code beautification/tests/docs but it has all functionality.

          we should also add a table for n=2, maybe n=3 also, but these can be separate issues.

          Show
          Robert Muir added a comment - implement the pq algorithm, when the value at the bottom of the pq changes (BoostAttribute maxCompetitiveBoost), the enum adjusts itself by decreasing edit distance, and swapping in more efficient code. remove the wasted prefix checks in automatonfuzzytermsenum, as Uwe noticed, because its not necessary and handled as part of the DFA itself (it will never seek to such terms). here is a patch, which is complete... needs code beautification/tests/docs but it has all functionality. we should also add a table for n=2, maybe n=3 also, but these can be separate issues.
          Hide
          Robert Muir added a comment -

          Strange that the automaton author did not add this? Have you reported upstream?

          in my opinion, the optimization is incomplete. I think it can be generalized more further as this:
          the NFA concatenation of DFA1 and DFA2 always results in a DFA, if SfDFA1 and SiDFA2 have no intersection, where SfDFA1 is the set of transitions from the final states of DFA1, and SiDFA2 is the set of transitions from the initial state of DFA2.

          i doubt the usefulness of this opto in general, but the very specific case where DFA1 is a singleton (and has no outgoing transitions so the intersection is null by definition) is extremely important to Lucene, as it prevents expensive NFA-DFA conversion for backwards compatibility with these "prefix" options (Automaton.getEnum and Fuzzy prefix), especially for the fuzzy case, DFA2 is very large.

          if a user supplies a prefix, it should make the query faster, not slower

          also, you will note i didnt optimize concatenate(List) but only concatenate(A1, A2). obviously a proper patch would optimization the List case, too.

          i feel the very specific case we care about is proved by induction in the junit cases i supplied, but i would think as this is a math-oriented library they would want the general opto and a proof... if you can find one let me know

          Show
          Robert Muir added a comment - Strange that the automaton author did not add this? Have you reported upstream? in my opinion, the optimization is incomplete. I think it can be generalized more further as this: the NFA concatenation of DFA1 and DFA2 always results in a DFA, if SfDFA1 and SiDFA2 have no intersection, where SfDFA1 is the set of transitions from the final states of DFA1, and SiDFA2 is the set of transitions from the initial state of DFA2. i doubt the usefulness of this opto in general, but the very specific case where DFA1 is a singleton (and has no outgoing transitions so the intersection is null by definition) is extremely important to Lucene, as it prevents expensive NFA-DFA conversion for backwards compatibility with these "prefix" options (Automaton.getEnum and Fuzzy prefix), especially for the fuzzy case, DFA2 is very large. if a user supplies a prefix, it should make the query faster, not slower also, you will note i didnt optimize concatenate(List) but only concatenate(A1, A2). obviously a proper patch would optimization the List case, too. i feel the very specific case we care about is proved by induction in the junit cases i supplied, but i would think as this is a math-oriented library they would want the general opto and a proof... if you can find one let me know
          Hide
          Robert Muir added a comment -

          The question is now, is this extra prefix check really needed?

          probably not, as you are right, nextString() will generate null, there is no need to explicitly handle the prefix. AutomatonTermsEnum does not do this either, for the same reason.

          Maybe you should add some comments to the AutomatonFuzzyTermsEnum

          yeah, i know it needs this... (the patch is not yet ready for committing, but it is functional).

          Show
          Robert Muir added a comment - The question is now, is this extra prefix check really needed? probably not, as you are right, nextString() will generate null, there is no need to explicitly handle the prefix. AutomatonTermsEnum does not do this either, for the same reason. Maybe you should add some comments to the AutomatonFuzzyTermsEnum yeah, i know it needs this... (the patch is not yet ready for committing, but it is functional).
          Hide
          Uwe Schindler added a comment -

          Hi Robert,
          I reviewed you latest patch and was a little bit irritated, but then everything explained when also looking into AutomatonTermsEnum and understanding what happes. But there is still some "code duplication" (not really duplication, but functionality duplication):

          • If a constant prefix is used, the generated Automatons are using a constant prefix + a Levenshtein Automaton (using concat)
          • If you run such an automaton agains the TermIndex using the superclass, it will seek first to the prefix term (or some term starting with the prefix), thats ok. As soon as the prefix is no longer valid, the AutomatonTermsEnum stops processing (if running such an automaton using the standard AutomatonTermsEnum).
          • The AutomatonFuzzyTermsEnum checks if the term starts with prefix and if not it exists ENDs the automaton. The reason why this works is because nextString() in superclass returns automatically a string starting with the prefix, but this was the main fact that irritated me.
          • The question is now, is this extra prefix check really needed? Running the automaton against the current term in accept would simply return no match and nextString() would stop further processing? Or is this because the accept method should not iterate through all distances once the prefix is not matched?

          Maybe you should add some comments to the AutomatonFuzzyTermsEnum or some asserts to show whats happening.

          Show
          Uwe Schindler added a comment - Hi Robert, I reviewed you latest patch and was a little bit irritated, but then everything explained when also looking into AutomatonTermsEnum and understanding what happes. But there is still some "code duplication" (not really duplication, but functionality duplication): If a constant prefix is used, the generated Automatons are using a constant prefix + a Levenshtein Automaton (using concat) If you run such an automaton agains the TermIndex using the superclass, it will seek first to the prefix term (or some term starting with the prefix), thats ok. As soon as the prefix is no longer valid, the AutomatonTermsEnum stops processing (if running such an automaton using the standard AutomatonTermsEnum). The AutomatonFuzzyTermsEnum checks if the term starts with prefix and if not it exists ENDs the automaton. The reason why this works is because nextString() in superclass returns automatically a string starting with the prefix, but this was the main fact that irritated me. The question is now, is this extra prefix check really needed? Running the automaton against the current term in accept would simply return no match and nextString() would stop further processing? Or is this because the accept method should not iterate through all distances once the prefix is not matched? Maybe you should add some comments to the AutomatonFuzzyTermsEnum or some asserts to show whats happening.
          Hide
          Uwe Schindler added a comment -

          this is the patch to improve BasicOperations.concatenate. If the left side is a singleton automaton, then it has only one final state with no outgoing transitions. applying epsilon transitions with the NFA concatenation algorithm when the right side is a DFA always produces a resulting DFA, so mark it as such.

          Strange that the automaton author did not add this? Have you reported upstream?

          Show
          Uwe Schindler added a comment - this is the patch to improve BasicOperations.concatenate. If the left side is a singleton automaton, then it has only one final state with no outgoing transitions. applying epsilon transitions with the NFA concatenation algorithm when the right side is a DFA always produces a resulting DFA, so mark it as such. Strange that the automaton author did not add this? Have you reported upstream?
          Hide
          Robert Muir added a comment -

          this is the patch to improve BasicOperations.concatenate. If the left side is a singleton automaton, then it has only one final state with no outgoing transitions. applying epsilon transitions with the NFA concatenation algorithm when the right side is a DFA always produces a resulting DFA, so mark it as such.

          if no one objects i will commit this small concatenate patch soon to the flex branch, as it also improves all other Automaton queries, and speeds up the rewrite-to-PrefixQuery check in AutomatonQuery.getEnum()

          Show
          Robert Muir added a comment - this is the patch to improve BasicOperations.concatenate. If the left side is a singleton automaton, then it has only one final state with no outgoing transitions. applying epsilon transitions with the NFA concatenation algorithm when the right side is a DFA always produces a resulting DFA, so mark it as such. if no one objects i will commit this small concatenate patch soon to the flex branch, as it also improves all other Automaton queries, and speeds up the rewrite-to-PrefixQuery check in AutomatonQuery.getEnum()
          Hide
          Robert Muir added a comment -

          updated patch:

          • when the user supplies a constant prefix, it is actually used for speeding up the automaton enum (less seeks). it is simply DFA-concatenated with the Lev DFA for the non-prefix part.

          i will upload a separate patch that improves BasicOperations.concatenate to allow this without creating an intermediate NFA.

          Show
          Robert Muir added a comment - updated patch: when the user supplies a constant prefix, it is actually used for speeding up the automaton enum (less seeks). it is simply DFA-concatenated with the Lev DFA for the non-prefix part. i will upload a separate patch that improves BasicOperations.concatenate to allow this without creating an intermediate NFA.
          Hide
          Robert Muir added a comment -

          updated patch:

          • the proxy is a TermsEnum not a FilteredTermsEnum, this makes more sense
          Show
          Robert Muir added a comment - updated patch: the proxy is a TermsEnum not a FilteredTermsEnum, this makes more sense
          Hide
          Robert Muir added a comment -

          here is a more fleshed out patch (rough but all tests pass):

          • FuzzyTermsEnum is a proxy enum, it delegates behavior to either LinearFuzzyTermsEnum (the old code) or AutomatonFuzzyTermsEnum, and can switch during enumeration.
          • TODO is to handle minBoostChanged, this is the event where the priority queue is full and the minimum score to be competitive has changed (the opportunity to switch to a more efficient enum).
          • code is generalized for all N, but i only include the table for N=0,1 for now.
          Show
          Robert Muir added a comment - here is a more fleshed out patch (rough but all tests pass): FuzzyTermsEnum is a proxy enum, it delegates behavior to either LinearFuzzyTermsEnum (the old code) or AutomatonFuzzyTermsEnum, and can switch during enumeration. TODO is to handle minBoostChanged, this is the event where the priority queue is full and the minimum score to be competitive has changed (the opportunity to switch to a more efficient enum). code is generalized for all N, but i only include the table for N=0,1 for now.
          Hide
          Robert Muir added a comment -

          I guess "gheto dfa" would not work, at least not fast enough (I didn't think about it really). Practically you would need to know which characters extend current character in you dictionary, or in DFA parlance, all outgoing transitions from the current state. "gheto dfa" cannot do it efficiently?

          i think it does it efficiently enough for any finite language (such as what you would use for fuzzy), the real problem with ghetto DFA relates more to infinite languages (* operator in wildcard/regex).

          its very easy to see the worst case and very difficult to see the average case, see: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.133.2155&rep=rep1&type=pdf

          if you have an idea you should try it on ghetto DFA (lucene's existing term dictionary), unless you can easily consume that average-case analysis presented in the paper (i can't)...

          eg its surprising to me in practice. Don't be afraid of the average complexity presented in the paper of O(sqrt(n)) where n is number of terms..., this is for "average regular expression", for things like levenstein automata it is much less as they are finite (e.g. 64 inspections for Lev1 of my 10 million terms benchmark, not 3,000 inspections.

          Show
          Robert Muir added a comment - I guess "gheto dfa" would not work, at least not fast enough (I didn't think about it really). Practically you would need to know which characters extend current character in you dictionary, or in DFA parlance, all outgoing transitions from the current state. "gheto dfa" cannot do it efficiently? i think it does it efficiently enough for any finite language (such as what you would use for fuzzy), the real problem with ghetto DFA relates more to infinite languages (* operator in wildcard/regex). its very easy to see the worst case and very difficult to see the average case, see: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.133.2155&rep=rep1&type=pdf if you have an idea you should try it on ghetto DFA (lucene's existing term dictionary), unless you can easily consume that average-case analysis presented in the paper (i can't)... eg its surprising to me in practice. Don't be afraid of the average complexity presented in the paper of O(sqrt(n)) where n is number of terms..., this is for "average regular expression", for things like levenstein automata it is much less as they are finite (e.g. 64 inspections for Lev1 of my 10 million terms benchmark, not 3,000 inspections.
          Hide
          Robert Muir added a comment -

          Eks, i actually linked FastSS to this issue (LUCENE-1513). I used to use it, but prefer automaton as i dont want to have to build/keep track of that massive deletion neighborhood. its not like it has better computational complexity or anything, and as # terms increases, these additional data structures become horrendous.

          for larger n, those deletion neighborhoods just get worse and worse (insanely large). just as query expansions do. and its a static structure, so its problematic for NRT and other use cases.

          Show
          Robert Muir added a comment - Eks, i actually linked FastSS to this issue ( LUCENE-1513 ). I used to use it, but prefer automaton as i dont want to have to build/keep track of that massive deletion neighborhood. its not like it has better computational complexity or anything, and as # terms increases, these additional data structures become horrendous. for larger n, those deletion neighborhoods just get worse and worse (insanely large). just as query expansions do. and its a static structure, so its problematic for NRT and other use cases.
          Hide
          Michael McCandless added a comment -

          and convince Mike to make blasing fast persisten trie

          Mike is looking forward to that!

          The terms dict should really be stored as transducer (DFA, where the edges can have outputs). This way both prefixes (trie) & suffixes (trie, in reverse) are shared. If this uses too much RAM, I wonder if we could do it only with the terms index, and then somehow automata query would efficiently identify parts (of 128 terms in a row) of the on-disk terms dict to scan based on the index (might be tricky...).

          This sudden burst of ideas & interest in fast fuzzy multi term queries is very exciting...

          Show
          Michael McCandless added a comment - and convince Mike to make blasing fast persisten trie Mike is looking forward to that! The terms dict should really be stored as transducer (DFA, where the edges can have outputs). This way both prefixes (trie) & suffixes (trie, in reverse) are shared. If this uses too much RAM, I wonder if we could do it only with the terms index, and then somehow automata query would efficiently identify parts (of 128 terms in a row) of the on-disk terms dict to scan based on the index (might be tricky...). This sudden burst of ideas & interest in fast fuzzy multi term queries is very exciting...
          Hide
          Eks Dev added a comment -

          ...Aaron i think generation may pose a problem for a full unicode alphabet...

          I wouldn't discount Aron's approach so quickly! There is one really smart way to aproach generation of the distance negborhood. Have a look at FastSS http://fastss.csg.uzh.ch/ The trick is to delete, not to genarate variations over complete alphabet! They call it "deletion negborhood". Also, generates much less variation Terms, reducing pressure on binary search in TermDict!

          You do not get all these goodies from Weighted distance implementation, but the solution is much simpler. Would work similary to the current spellchecker (just lookup on "variations"), only faster. They have even some exemple code to see how they generate "deletions" (http://fastss.csg.uzh.ch/FastSimilarSearch.java).

          but the more intelligent stuff you speak of could be really cool esp. for spellchecking, sure you dont want to rewrite our spellchecker?

          btw its not clear to me yet, could you implement that stuff on top of "ghetto DFA" (the sorted terms dict we have now) or is something more sophisticated needed? its a lot easier to write this stuff now with the flex MTQ apis

          I really would love to, but I was paid before to work on this.

          I guess "gheto dfa" would not work, at least not fast enough (I didn't think about it really). Practically you would need to know which characters extend current character in you dictionary, or in DFA parlance, all outgoing transitions from the current state. "gheto dfa" cannot do it efficiently?

          What would be an idea with flex is to implement this stuff with an in memory trie (full trie or TST), befor jumping into noisy channel (this is easy to add later) and persistent trie-dictionary. The traversal part is identical, and would make a nice contrib with a usefull use case as the majority of folks have enogh memory to slurp complete termDict into memory... Would serve as a proof of concept for flex and fuzzyQ, help you understand the magic of calculating edit distance against Trie structures. Once you have trie structure, the sky is the limit, prefix, regex... If I remeber corectly, there were some trie implmentations floating around, with it you need just one extra traversal method to find all terms at distance N. You can have a look at "http://jaspell.sourceforge.net/" TST implmentation, class TernarySearchTrie.matchAlmost(...) methods. Just for an ilustration what is going there, it is simple recursive traversal of all terms at max distance of N.
          Later we could tweak memory demand, switch to some more compact trie... and at the and add weighted distance and convince Mike to make blasing fast persisten trie ... in meantime, the folks with enogh memory would have really really fast fuzzy, prefix... better distance...

          So the theory I hope you find these comments usful, even without patches

          Show
          Eks Dev added a comment - ...Aaron i think generation may pose a problem for a full unicode alphabet... I wouldn't discount Aron's approach so quickly! There is one really smart way to aproach generation of the distance negborhood. Have a look at FastSS http://fastss.csg.uzh.ch/ The trick is to delete, not to genarate variations over complete alphabet! They call it "deletion negborhood". Also, generates much less variation Terms, reducing pressure on binary search in TermDict! You do not get all these goodies from Weighted distance implementation, but the solution is much simpler. Would work similary to the current spellchecker (just lookup on "variations"), only faster. They have even some exemple code to see how they generate "deletions" ( http://fastss.csg.uzh.ch/FastSimilarSearch.java ). but the more intelligent stuff you speak of could be really cool esp. for spellchecking, sure you dont want to rewrite our spellchecker? btw its not clear to me yet, could you implement that stuff on top of "ghetto DFA" (the sorted terms dict we have now) or is something more sophisticated needed? its a lot easier to write this stuff now with the flex MTQ apis I really would love to, but I was paid before to work on this. I guess "gheto dfa" would not work, at least not fast enough (I didn't think about it really). Practically you would need to know which characters extend current character in you dictionary, or in DFA parlance, all outgoing transitions from the current state. "gheto dfa" cannot do it efficiently? What would be an idea with flex is to implement this stuff with an in memory trie (full trie or TST), befor jumping into noisy channel (this is easy to add later) and persistent trie-dictionary. The traversal part is identical, and would make a nice contrib with a usefull use case as the majority of folks have enogh memory to slurp complete termDict into memory... Would serve as a proof of concept for flex and fuzzyQ, help you understand the magic of calculating edit distance against Trie structures. Once you have trie structure, the sky is the limit, prefix, regex... If I remeber corectly, there were some trie implmentations floating around, with it you need just one extra traversal method to find all terms at distance N. You can have a look at "http://jaspell.sourceforge.net/" TST implmentation, class TernarySearchTrie.matchAlmost(...) methods. Just for an ilustration what is going there, it is simple recursive traversal of all terms at max distance of N. Later we could tweak memory demand, switch to some more compact trie... and at the and add weighted distance and convince Mike to make blasing fast persisten trie ... in meantime, the folks with enogh memory would have really really fast fuzzy, prefix... better distance... So the theory I hope you find these comments usful, even without patches
          Hide
          Robert Muir added a comment -

          thanks, this will give you something to play with wrt DFA queries.
          in flex, Regex and wildcard are already implemented this way.

          one additional complexity here that this code will not do (until some other issues are resolved and i update the patch), is the "pq trick".

          This is the fact that FuzzyQuery (or really anything else that uses TopTermsRewrite) is really an n-best list, and we shouldn't seek to terms that will simply be overflowed off the priority queue anyway, as its just a waste of time.

          this is the part i am working on now, most of it already one under LUCENE-2140 and some setup now under LUCENE-2261

          Show
          Robert Muir added a comment - thanks, this will give you something to play with wrt DFA queries. in flex, Regex and wildcard are already implemented this way. one additional complexity here that this code will not do (until some other issues are resolved and i update the patch), is the "pq trick". This is the fact that FuzzyQuery (or really anything else that uses TopTermsRewrite) is really an n-best list, and we shouldn't seek to terms that will simply be overflowed off the priority queue anyway, as its just a waste of time. this is the part i am working on now, most of it already one under LUCENE-2140 and some setup now under LUCENE-2261
          Hide
          Aaron McCurry added a comment -

          Sure I will give it a try, still wrapping my head around the flex branch.

          Show
          Aaron McCurry added a comment - Sure I will give it a try, still wrapping my head around the flex branch.
          Hide
          Robert Muir added a comment -

          Excuse my ignorance, but what is a DFA?

          here is wikipedia page with a description: http://en.wikipedia.org/wiki/Deterministic_finite-state_machine

          Show
          Robert Muir added a comment - Excuse my ignorance, but what is a DFA? here is wikipedia page with a description: http://en.wikipedia.org/wiki/Deterministic_finite-state_machine
          Hide
          Robert Muir added a comment -

          So I have to ask, who has the entire unicode alphabet indexed in a single field? I don't, and I am willing to make tradeoffs in my system for performance.

          Aaron, would you mind testing this patch with the flex branch (with the sample code i listed before)? There is no tradeoff to support unicode.
          If you don't have chinese chars in your index, the enum will not seek to them, but skip past them, as they do not exist, and Term(s)Enum always returns things in sorted order.

          For more details, see AutomatonTermsEnum in flex (this is the code that actually does DFA <-> DFA intersection of any Automaton with lucene's term dictionary).

          Show
          Robert Muir added a comment - So I have to ask, who has the entire unicode alphabet indexed in a single field? I don't, and I am willing to make tradeoffs in my system for performance. Aaron, would you mind testing this patch with the flex branch (with the sample code i listed before)? There is no tradeoff to support unicode. If you don't have chinese chars in your index, the enum will not seek to them, but skip past them, as they do not exist, and Term(s)Enum always returns things in sorted order. For more details, see AutomatonTermsEnum in flex (this is the code that actually does DFA <-> DFA intersection of any Automaton with lucene's term dictionary).
          Hide
          Aaron McCurry added a comment - - edited

          Aaron i think generation may pose a problem for a full unicode alphabet.
          e.g. edit distance of 1 on "otter" is 720,891 strings... this is a large boolean query!

          So I have to ask, who has the entire unicode alphabet indexed in a single field? I don't, and I am willing to make tradeoffs in my system for performance. So for instance, I am willing to code a mapping file or scan my index on startup to know the character set (a-z 0-9 or whatever) to reduce the possibilities. This is a simple know your data problem, at least in IMHO.

          Aaron

          Show
          Aaron McCurry added a comment - - edited Aaron i think generation may pose a problem for a full unicode alphabet. e.g. edit distance of 1 on "otter" is 720,891 strings... this is a large boolean query! So I have to ask, who has the entire unicode alphabet indexed in a single field? I don't, and I am willing to make tradeoffs in my system for performance. So for instance, I am willing to code a mapping file or scan my index on startup to know the character set (a-z 0-9 or whatever) to reduce the possibilities. This is a simple know your data problem, at least in IMHO. Aaron
          Hide
          Aaron McCurry added a comment -

          Excuse my ignorance, but what is a DFA?

          Show
          Aaron McCurry added a comment - Excuse my ignorance, but what is a DFA?
          Hide
          Robert Muir added a comment -

          But who would use standard distance with such a beast, reducing impact of inserting/deleting silent "h" as in "Thomas" "Tomas"...

          hehe, well my only goal with this issue (and really automaton in general) is to speed up in a backwards compatible way, so I am stuck with standard distance for that purpose.

          but the more intelligent stuff you speak of could be really cool esp. for spellchecking, sure you dont want to rewrite our spellchecker?

          btw its not clear to me yet, could you implement that stuff on top of "ghetto DFA" (the sorted terms dict we have now) or is something more sophisticated needed? its a lot easier to write this stuff now with the flex MTQ apis

          Show
          Robert Muir added a comment - But who would use standard distance with such a beast, reducing impact of inserting/deleting silent "h" as in "Thomas" "Tomas"... hehe, well my only goal with this issue (and really automaton in general) is to speed up in a backwards compatible way, so I am stuck with standard distance for that purpose. but the more intelligent stuff you speak of could be really cool esp. for spellchecking, sure you dont want to rewrite our spellchecker? btw its not clear to me yet, could you implement that stuff on top of "ghetto DFA" (the sorted terms dict we have now) or is something more sophisticated needed? its a lot easier to write this stuff now with the flex MTQ apis
          Hide
          Robert Muir added a comment -

          Aaron i think generation may pose a problem for a full unicode alphabet.

          e.g. edit distance of 1 on "otter" is 720,891 strings... this is a large boolean query!

          so this is where the ping-ponging of automatonquery really helps, we dont have to do 720,891 term lookups because the term dictionary is in sorted order, so its treated like dfa intersection.

          btw, you can test this patch with your 170M terms by applying this patch to flex branch, and using the following code.

          LevenshteinAutomata a = new LevenshteinAutomata("otter"); // searching on otter
          Automaton lev1 = a.toAutomaton(1); // edit distance of 1
          Query query = new AutomatonQuery(new Term("field", "bogus"), lev1);
          ...
          
          Show
          Robert Muir added a comment - Aaron i think generation may pose a problem for a full unicode alphabet. e.g. edit distance of 1 on "otter" is 720,891 strings... this is a large boolean query! so this is where the ping-ponging of automatonquery really helps, we dont have to do 720,891 term lookups because the term dictionary is in sorted order, so its treated like dfa intersection. btw, you can test this patch with your 170M terms by applying this patch to flex branch, and using the following code. LevenshteinAutomata a = new LevenshteinAutomata( "otter" ); // searching on otter Automaton lev1 = a.toAutomaton(1); // edit distance of 1 Query query = new AutomatonQuery( new Term( "field" , "bogus" ), lev1); ...
          Hide
          Aaron McCurry added a comment -

          I have written a levenstein generator today that seems to operate similarly to what is being discussed here. It generates all the possible matches to levenstein algorithm given a term and a character set, it then creates a booleanquery from it. For a given term with edit distance of 1 or 2 it is extremely fast. I tested it on my dev data that has about 8 billion documents with 20 shards, each shard has about 170,000,000 terms in the field that I'm testing. The normal fuzzy query with a term length of 8 and and edit distance of 2 took about 110 seconds to complete, and the auto generated query took around a 1.5 seconds complete. However this method will probably only work with edits distances in the 1 and 2 range, because once I hit 3 it spiked the memory usage and slowed way down (to be expected). Not sure if you all want to take a look at my code or not, but if you want me to post it I will.

          Show
          Aaron McCurry added a comment - I have written a levenstein generator today that seems to operate similarly to what is being discussed here. It generates all the possible matches to levenstein algorithm given a term and a character set, it then creates a booleanquery from it. For a given term with edit distance of 1 or 2 it is extremely fast. I tested it on my dev data that has about 8 billion documents with 20 shards, each shard has about 170,000,000 terms in the field that I'm testing. The normal fuzzy query with a term length of 8 and and edit distance of 2 took about 110 seconds to complete, and the auto generated query took around a 1.5 seconds complete. However this method will probably only work with edits distances in the 1 and 2 range, because once I hit 3 it spiked the memory usage and slowed way down (to be expected). Not sure if you all want to take a look at my code or not, but if you want me to post it I will.
          Hide
          Eks Dev added a comment -

          I assume you mean by weighted edit distance that the transitions in the state machine would have costs?

          Yes, kind of, not embedded in the trie, just defined externally.

          What I am talking about is a part of the noisy channel approach, modeling only channel distribution. Have a look at the http://norvig.com/spell-correct.html for basic theory. I am suggesting almost the same, just applied at character level and without language model part. It is rather easy once you have your dictionary in some sort of tree structure.

          You guide your trie traversal over the trie by iterating on each char in your search term accumulating log probabilities of single transformations (recycling prefix part). When you hit a leaf insert into PriorityQueue of appropriate depth. What I mean by "probabilities of single transformations" are defined as:
          insertion(character a)//map char->log probability (think of it as kind of "cost of inserting this particular character)
          deletion(character)//map char->log probability...
          transposition(char a, char b)
          replacement(char a, char b)//2D matrix <char,char>->probability (cost)
          if you wish , you could even add some positional information, boosting match on start/end of the string

          I avoided tricky mechanicson traversal, insertion, deletion, but on trie you can do it by following different paths...

          the only good implementation (in memory) around there I know of is in LingPipe spell checker (they implement full Noisy Channel, with Language model driving traversal)... has huge educational value, Bob is really great at explaining things. The code itself is proprietary.
          I would suggest you to peek into this code to see this 2-Minute rumbling I wrote here properly explained Just ignore the language model part and assume you have NULL language model (all chars in language are equally probable) , doing full traversal over the trie.

          If this is the case couldn't we even define standard levenshtein very easily (instead of nasty math), and would the beam search technique enumerate efficiently for us?

          Standard Lev. is trivially configured once you have this, it is just setting all these costs to 1 (delete, insert... in log domain)... But who would use standard distance with such a beast, reducing impact of inserting/deleting silent "h" as in "Thomas" "Tomas"...
          Enumeration is trie traversal, practically calculating distance against all terms at the same time and collectiong N best along the way. The place where you save your time is recycling prefix part in this calculation. Enumeration is optimal as this trie there contains only the terms from termDict, you are not trying all possible alphabet characters and you can implement "early path abandoning" easily ether by cost (log probability) or/and by limiting the number of successive insertions

          If interested in really in depth things, look at http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198
          Great book, (another great tip from Bob@LingPipe). A bit strange with terminology (at least to me), but once you get used to it, is really worth the time you spend trying to grasp it.

          Show
          Eks Dev added a comment - I assume you mean by weighted edit distance that the transitions in the state machine would have costs? Yes, kind of, not embedded in the trie, just defined externally. What I am talking about is a part of the noisy channel approach, modeling only channel distribution. Have a look at the http://norvig.com/spell-correct.html for basic theory. I am suggesting almost the same, just applied at character level and without language model part. It is rather easy once you have your dictionary in some sort of tree structure. You guide your trie traversal over the trie by iterating on each char in your search term accumulating log probabilities of single transformations (recycling prefix part). When you hit a leaf insert into PriorityQueue of appropriate depth. What I mean by "probabilities of single transformations" are defined as: insertion(character a)//map char->log probability (think of it as kind of "cost of inserting this particular character) deletion(character)//map char->log probability... transposition(char a, char b) replacement(char a, char b)//2D matrix <char,char>->probability (cost) if you wish , you could even add some positional information, boosting match on start/end of the string I avoided tricky mechanicson traversal, insertion, deletion, but on trie you can do it by following different paths... the only good implementation (in memory) around there I know of is in LingPipe spell checker (they implement full Noisy Channel, with Language model driving traversal)... has huge educational value, Bob is really great at explaining things. The code itself is proprietary. I would suggest you to peek into this code to see this 2-Minute rumbling I wrote here properly explained Just ignore the language model part and assume you have NULL language model (all chars in language are equally probable) , doing full traversal over the trie. If this is the case couldn't we even define standard levenshtein very easily (instead of nasty math), and would the beam search technique enumerate efficiently for us? Standard Lev. is trivially configured once you have this, it is just setting all these costs to 1 (delete, insert... in log domain)... But who would use standard distance with such a beast, reducing impact of inserting/deleting silent "h" as in "Thomas" "Tomas"... Enumeration is trie traversal, practically calculating distance against all terms at the same time and collectiong N best along the way. The place where you save your time is recycling prefix part in this calculation. Enumeration is optimal as this trie there contains only the terms from termDict, you are not trying all possible alphabet characters and you can implement "early path abandoning" easily ether by cost (log probability) or/and by limiting the number of successive insertions If interested in really in depth things, look at http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 Great book, (another great tip from Bob@LingPipe). A bit strange with terminology (at least to me), but once you get used to it, is really worth the time you spend trying to grasp it.
          Hide
          Robert Muir added a comment -

          Imo, the only thing that would work better is to make term dictionary real trie (ternary, n-ary, dfa, makes no big diff).

          yeah I think a real DFA would be more efficient, its too much to think about. I guess this is why we leave this kind of thing to Mike

          I haven't grasped how this beam search would work with a DFA dictionary, although it definitely sounds cool.
          I assume you mean by weighted edit distance that the transitions in the state machine would have costs?
          If this is the case couldn't we even define standard levenshtein very easily (instead of nasty math), and would the beam search technique enumerate efficiently for us?

          Show
          Robert Muir added a comment - Imo, the only thing that would work better is to make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). yeah I think a real DFA would be more efficient, its too much to think about. I guess this is why we leave this kind of thing to Mike I haven't grasped how this beam search would work with a DFA dictionary, although it definitely sounds cool. I assume you mean by weighted edit distance that the transitions in the state machine would have costs? If this is the case couldn't we even define standard levenshtein very easily (instead of nasty math), and would the beam search technique enumerate efficiently for us?
          Hide
          Eks Dev added a comment -

          What about this,
          http://www.catalysoft.com/articles/StrikeAMatch.html
          it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster?

          Is that only me who sees plain, vanilla bigram distance here? What is new or better in StrikeAMatch compared to the first phase of the current SpellCehcker (feeding PriorityQueue with candidates)?

          If you need too use this, nothing simpler, you do not even need pair comparison (aka traversal), just Index terms split into bigrams and search with standard Query.

          Autmaton trick is a neat one. Imo, the only thing that would work better is to make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). Making TerrmDict some sort of trie/dfa would permit smart beam-search, even without compiling query DFA. Beam search also makes implementation of better distances possible (Weighted Edit distance without "metric constraint" ). I guess this is going to be possible with Flex, Mike was allready talking about DFA Dictionary

          It took a while to figure out the trick Robert pooled here, treating term dictionary as another DFA due to the sortedness, nice.

          Show
          Eks Dev added a comment - What about this, http://www.catalysoft.com/articles/StrikeAMatch.html it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster? Is that only me who sees plain, vanilla bigram distance here? What is new or better in StrikeAMatch compared to the first phase of the current SpellCehcker (feeding PriorityQueue with candidates)? If you need too use this, nothing simpler, you do not even need pair comparison (aka traversal), just Index terms split into bigrams and search with standard Query. Autmaton trick is a neat one. Imo, the only thing that would work better is to make term dictionary real trie (ternary, n-ary, dfa, makes no big diff). Making TerrmDict some sort of trie/dfa would permit smart beam-search, even without compiling query DFA. Beam search also makes implementation of better distances possible (Weighted Edit distance without "metric constraint" ). I guess this is going to be possible with Flex, Mike was allready talking about DFA Dictionary It took a while to figure out the trick Robert pooled here, treating term dictionary as another DFA due to the sortedness, nice.
          Hide
          Fuad Efendi added a comment - - edited

          Another idea (similar to 50-years-old "auto-recovery"), it doesn't allow me to sleep
          What if we do all distance calculations (and other types of calculations) at indexing time instead of at query time? For instance, we may have index structure like

          {Term, List[MisspelledTerm, Distance]}

          , and we can query this structure by

          {MisspelledTerm, Distance}

          ? We mentioned it here already, LUCENE-1513, but our use case is very specific... and why to allow 5 spelling mistakes in Unicode if user's input contains 3 characters only in Latin1? We should hardcode some constraints.

          Yes, memory requirements... in case of "????dogs" it can be at least few millions of additional misspelled-terms for this specific "dogs" term only... but it doesn't grow linearly... and we can limit such structure for distance=2, and use additional query-time processing if we need distance=3.

          It's just (naive) idea: to precalculate "similar terms" at indexing time...

          Show
          Fuad Efendi added a comment - - edited Another idea (similar to 50-years-old "auto-recovery"), it doesn't allow me to sleep What if we do all distance calculations (and other types of calculations) at indexing time instead of at query time? For instance, we may have index structure like {Term, List[MisspelledTerm, Distance]} , and we can query this structure by {MisspelledTerm, Distance} ? We mentioned it here already, LUCENE-1513 , but our use case is very specific... and why to allow 5 spelling mistakes in Unicode if user's input contains 3 characters only in Latin1? We should hardcode some constraints. Yes, memory requirements... in case of "????dogs" it can be at least few millions of additional misspelled-terms for this specific "dogs" term only... but it doesn't grow linearly... and we can limit such structure for distance=2, and use additional query-time processing if we need distance=3. It's just (naive) idea: to precalculate "similar terms" at indexing time...
          Hide
          Fuad Efendi added a comment -

          Levenshtein Distance is good for "Spelling Corrections" use case (even terminology is the same: insert, delete, replace...)
          But is is not good for more generic similarity: distance between RunAutomation and AutomationRun is huge (6!). But it is two-word combination indeed,and I don't know good one-(human)-word use case where Levenshtein Distance is not good (or natural). From other viewpoint, I can't see any use case where StrikeAMatch (counts of 2-char similarities) is bad, although it is not "spelling corrections". And, from third viewpoint, if we totally forget that it is indeed human-generated-input and implement Levenshtein distance on raw bitsets instead of unicode characters (end-user clicks on keyboard)... we will get totally non-acceptable results...

          I believe such "distance" algos were initially designed many years ago, before Internet (and Search), to allow auto-recovery during data transmission (first astronauts...) - but autorecovery was based on fact that (acceptable) mistaken code has one and only one closest match from the dictionary; so it was extremely fast (50 years ago). And now, we are using old algo designed for completely different use case (fixed-size bitset transmissions) for "spelling corrections"...

          What if we will focus on a keyboard (101 keys?) instead of Unicode... "spelling corrections"...

          20ms is not good, it is 50TPS only (on a single core)...

          Show
          Fuad Efendi added a comment - Levenshtein Distance is good for "Spelling Corrections" use case (even terminology is the same: insert, delete, replace...) But is is not good for more generic similarity: distance between RunAutomation and AutomationRun is huge (6!). But it is two-word combination indeed,and I don't know good one-(human)-word use case where Levenshtein Distance is not good (or natural). From other viewpoint, I can't see any use case where StrikeAMatch (counts of 2-char similarities) is bad, although it is not "spelling corrections". And, from third viewpoint, if we totally forget that it is indeed human-generated-input and implement Levenshtein distance on raw bitsets instead of unicode characters (end-user clicks on keyboard)... we will get totally non-acceptable results... I believe such "distance" algos were initially designed many years ago, before Internet (and Search), to allow auto-recovery during data transmission (first astronauts...) - but autorecovery was based on fact that (acceptable) mistaken code has one and only one closest match from the dictionary; so it was extremely fast (50 years ago). And now, we are using old algo designed for completely different use case (fixed-size bitset transmissions) for "spelling corrections"... What if we will focus on a keyboard (101 keys?) instead of Unicode... "spelling corrections"... 20ms is not good, it is 50TPS only (on a single core)...
          Hide
          Robert Muir added a comment -

          fuad it does not expand the query to OR'ed terms.

          here is a description from the 2002 paper that describes how automaton works (for regex, wildcard, fuzzy, whatever the case)
          I edited the description to substitute components with their implementation... and yes it works for all Unicode (all over 1 million codepoints not just the BMP)

          For more details, see AutomatonTermsEnum in the flex branch svn.

          The set of all dictionary words is treated as a regular language over the alphabet of letters. At each step, the prefix of all letters that are consumed on the path from the initial state to the current state is maintained. A variant of the Wagner-Fisher algorithm is used to control the walk through the automaton (enumeration of Lucene's term dictionary) in such a way that only prefixes are generated (FilteredTermsEnums only seeks to terms) that potentially lead to a correction candidate V where the Levenshtein distance between V and W does not exceed a fixed bound n.

          Show
          Robert Muir added a comment - fuad it does not expand the query to OR'ed terms. here is a description from the 2002 paper that describes how automaton works (for regex, wildcard, fuzzy, whatever the case) I edited the description to substitute components with their implementation... and yes it works for all Unicode (all over 1 million codepoints not just the BMP) For more details, see AutomatonTermsEnum in the flex branch svn. The set of all dictionary words is treated as a regular language over the alphabet of letters. At each step, the prefix of all letters that are consumed on the path from the initial state to the current state is maintained. A variant of the Wagner-Fisher algorithm is used to control the walk through the automaton (enumeration of Lucene's term dictionary) in such a way that only prefixes are generated (FilteredTermsEnums only seeks to terms) that potentially lead to a correction candidate V where the Levenshtein distance between V and W does not exceed a fixed bound n.
          Hide
          Fuad Efendi added a comment - - edited

          Ok, now I understand what AutomatonQuery is... frankly, I had this idea, to create small dictionary of "similar words", to create terms from those words, and to execute query <Word1> OR <Word2> OR ... instead of scanning whole term dictionary, but how "small" will be such dictionary in case, for instance, "????dogs"... is size of dictionary (in case of ASCII-characters) 26*26*26*26? Or, 65536*65536*65536*65536 in case of Unicode?
          Simple.
          Is it so simple?

          Even with Unicode, we can precount set of characters for a specific field instance; it can be 36 characters; and query like <aaaadogs> OR <aaabdogs> OR ... OR <9999dogs>

          and, if we can quickly find intersection of custom dictionary with terms dictionary, then build the query... am I on correct path with understanding?

          Show
          Fuad Efendi added a comment - - edited Ok, now I understand what AutomatonQuery is... frankly, I had this idea, to create small dictionary of "similar words", to create terms from those words, and to execute query <Word1> OR <Word2> OR ... instead of scanning whole term dictionary, but how "small" will be such dictionary in case, for instance, "????dogs"... is size of dictionary (in case of ASCII-characters) 26*26*26*26? Or, 65536*65536*65536*65536 in case of Unicode? Simple. Is it so simple? Even with Unicode, we can precount set of characters for a specific field instance; it can be 36 characters; and query like <aaaadogs> OR <aaabdogs> OR ... OR <9999dogs> and, if we can quickly find intersection of custom dictionary with terms dictionary, then build the query... am I on correct path with understanding?
          Hide
          Robert Muir added a comment -

          what I am trying to say is this: can DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)?

          It depends a lot on what is in your index! for now i use a benchmark of generating 10 million unique terms, and generating random fuzzy queries against them. There is no reason to involve Solr, I don't understand what it has to do with benchmarking this, and its still stuck on Lucene 2.9 so it wouldnt be practical in any case.

          Also, is DFA mostly CPU-bound? Can we improve it by making (some) I/O-bound unload?

          not really, I dont think the DFA intersection can really be any faster than it is. for example on the 10 million terms benchmark, a Lev(1) DFA must seek to 64 terms. This averages right under 20ms last i checked (it is currently averaging over 3 seconds in trunk and flex).

          Show
          Robert Muir added a comment - what I am trying to say is this: can DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)? It depends a lot on what is in your index! for now i use a benchmark of generating 10 million unique terms, and generating random fuzzy queries against them. There is no reason to involve Solr, I don't understand what it has to do with benchmarking this, and its still stuck on Lucene 2.9 so it wouldnt be practical in any case. Also, is DFA mostly CPU-bound? Can we improve it by making (some) I/O-bound unload? not really, I dont think the DFA intersection can really be any faster than it is. for example on the 10 million terms benchmark, a Lev(1) DFA must seek to 64 terms. This averages right under 20ms last i checked (it is currently averaging over 3 seconds in trunk and flex).
          Hide
          Fuad Efendi added a comment - - edited

          For LUCENE-2230 I did a lot of long-run load-stress tests (against SOLR), but before doing that I created baseline for static admin screen in SOLR: 1500TPS. And I reached 220TPS with Fuzzy Search... what I am trying to say is this: can DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)? Baseline for static page is 1500. Also, is DFA mostly CPU-bound? Can we improve it by making (some) I/O-bound unload?
          Just joking

          I used explicitly distance threshold=2, that's why 220TPS... with threshold=5 it would be 50TPS or may be less...

          If DFA doesn't have dependency on threshold, it is the way to go.

          Show
          Fuad Efendi added a comment - - edited For LUCENE-2230 I did a lot of long-run load-stress tests (against SOLR), but before doing that I created baseline for static admin screen in SOLR: 1500TPS. And I reached 220TPS with Fuzzy Search... what I am trying to say is this: can DFA with Levenshtein reach 250TPS (in real-world multi-tier web environment)? Baseline for static page is 1500. Also, is DFA mostly CPU-bound? Can we improve it by making (some) I/O-bound unload? Just joking I used explicitly distance threshold=2, that's why 220TPS... with threshold=5 it would be 50TPS or may be less... If DFA doesn't have dependency on threshold, it is the way to go.
          Hide
          Robert Muir added a comment -

          and we need to be able to compare old relevance with new one (with integer distance threshold it is not the same as with classic float-point...)

          this is not true, we can determine for any word, and any floating point value, the maximum integer distance that equates.

          This query will be completely backwards compatible, same results, same order, only faster.

          What if we can store some precounted values in the index... such as storing "similar terms" in additional field... or some pieces of DFA (which I still need to learn...)

          This is not necessary for AutomatonQuery, and imposes too much of a space tradeoff. You can see an example of this in LUCENE-1513, where deletion neighborhoods are indexed.

          However, given the speedups to AutomatonQuery itself done in LUCENE-1606, it will actually surpass this approach, impose no index or storage requirements, scale to higher N, and allow for full back compat.

          The only "tradeoff" in this approach is the headache caused by the mathematics, and the majority of this is now solved... unlike approaches that require additional disk or memory or indexing changes, users are unaffected if we stay up too late staring at funky math.

          Show
          Robert Muir added a comment - and we need to be able to compare old relevance with new one (with integer distance threshold it is not the same as with classic float-point...) this is not true, we can determine for any word, and any floating point value, the maximum integer distance that equates. This query will be completely backwards compatible, same results, same order, only faster. What if we can store some precounted values in the index... such as storing "similar terms" in additional field... or some pieces of DFA (which I still need to learn...) This is not necessary for AutomatonQuery, and imposes too much of a space tradeoff. You can see an example of this in LUCENE-1513 , where deletion neighborhoods are indexed. However, given the speedups to AutomatonQuery itself done in LUCENE-1606 , it will actually surpass this approach, impose no index or storage requirements, scale to higher N, and allow for full back compat. The only "tradeoff" in this approach is the headache caused by the mathematics, and the majority of this is now solved... unlike approaches that require additional disk or memory or indexing changes, users are unaffected if we stay up too late staring at funky math.
          Hide
          Fuad Efendi added a comment -

          Hi Robert,

          Yes, I agree; we need to stick with Levenshtein distance also to isolate performance comparisons: same distance, but FuzzyTermEnum with full-scan vs. DFA-based approach, and we need to be able to compare old relevance with new one (with integer distance threshold it is not the same as with classic float-point...) thanks for the link to your article!

          What if we can store some precounted values in the index... such as storing "similar terms" in additional field... or some pieces of DFA (which I still need to learn...)

          Show
          Fuad Efendi added a comment - Hi Robert, Yes, I agree; we need to stick with Levenshtein distance also to isolate performance comparisons: same distance, but FuzzyTermEnum with full-scan vs. DFA-based approach, and we need to be able to compare old relevance with new one (with integer distance threshold it is not the same as with classic float-point...) thanks for the link to your article! What if we can store some precounted values in the index... such as storing "similar terms" in additional field... or some pieces of DFA (which I still need to learn...)
          Hide
          Robert Muir added a comment -

          Fuad as far as edit distances go, I am implementing Levenshtein only out of interest of speeding up core FuzzyQuery (and keeping backwards compatibility).

          its entirely possible you could write an algorithm to create a DFA that accepts any word within n strikeAMatch distances if you wanted to. you could use AutomatonQuery then to run it efficiently.

          The only reason i select levenshtein is because thats what FuzzyQuery does already, not because I think its particularly "good", although I do feel its the measure that FuzzyQuery should stick with, as it is the most general-purpose.

          The distance measure you mentioned might make more sense for some languages and purposes, but for others maybe not.

          Show
          Robert Muir added a comment - Fuad as far as edit distances go, I am implementing Levenshtein only out of interest of speeding up core FuzzyQuery (and keeping backwards compatibility). its entirely possible you could write an algorithm to create a DFA that accepts any word within n strikeAMatch distances if you wanted to. you could use AutomatonQuery then to run it efficiently. The only reason i select levenshtein is because thats what FuzzyQuery does already, not because I think its particularly "good", although I do feel its the measure that FuzzyQuery should stick with, as it is the most general-purpose. The distance measure you mentioned might make more sense for some languages and purposes, but for others maybe not.
          Hide
          Fuad Efendi added a comment - - edited

          What about this,
          http://www.catalysoft.com/articles/StrikeAMatch.html
          it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster?

          Show
          Fuad Efendi added a comment - - edited What about this, http://www.catalysoft.com/articles/StrikeAMatch.html it seems logically more appropriate to (human-entered) text objects than Levenshtein distance, and it is (in theory) extremely fast; is DFA-distance faster?
          Hide
          Robert Muir added a comment -

          but BKTree can help only if threshold is small, otherwise it is similar to full-scan.

          right, this is one aspect that makes the DFA approach attractive, even if there is a high threshold, so high that its not even worth seeking around and brute-force iteration is faster, we can still compute the levenstein distance in O(n) time with a DFA, instead of the expensive dynamic programming algorithm.

          i found this out for wildcard as well, as even when just doing brute-force enumeration, the compiled RunAutomaton eats the old hand-coded wildcardEquals() for breakfast, because its a faster matcher.

          this is in line with some tests done by jflex: http://jflex.de/manual.html#PerformanceHandwritten . I have to agree that in this day and age, writing hand-coded stuff to parse or match is simply a waste of time, as chances are a DFA will be considerably more efficient.

          Show
          Robert Muir added a comment - but BKTree can help only if threshold is small, otherwise it is similar to full-scan. right, this is one aspect that makes the DFA approach attractive, even if there is a high threshold, so high that its not even worth seeking around and brute-force iteration is faster, we can still compute the levenstein distance in O(n) time with a DFA, instead of the expensive dynamic programming algorithm. i found this out for wildcard as well, as even when just doing brute-force enumeration, the compiled RunAutomaton eats the old hand-coded wildcardEquals() for breakfast, because its a faster matcher. this is in line with some tests done by jflex: http://jflex.de/manual.html#PerformanceHandwritten . I have to agree that in this day and age, writing hand-coded stuff to parse or match is simply a waste of time, as chances are a DFA will be considerably more efficient.
          Hide
          Robert Muir added a comment -

          Distance (submitted by end user) must be integer

          this is not true for this issue. the plan is to keep all apis and results the same, we can just convert the minimum distance into an integer N to determine which DFA needs to be executing.

          another difference here is that the minimum distance is not fixed, but changes during enumeration, if the priority queue fills up then there is no point wasting time seeking to terms that will only be dropped, see LUCENE-2140

          Show
          Robert Muir added a comment - Distance (submitted by end user) must be integer this is not true for this issue. the plan is to keep all apis and results the same, we can just convert the minimum distance into an integer N to determine which DFA needs to be executing. another difference here is that the minimum distance is not fixed, but changes during enumeration, if the priority queue fills up then there is no point wasting time seeking to terms that will only be dropped, see LUCENE-2140
          Hide
          Robert Muir added a comment -

          Fuad yes, this is the problem with the paper you listed. in my opinion the paper you mentioned is actually better/simpler in some ways than what we are using, but the problem is the term dictionary would have to be restructured for this, or we would have to keep all terms in RAM or something worse.

          the advantage of the (more complex) 2002 paper is that we can run it on lucene's existing term dictionary, and it will seek to the right spots (i.e. you can take the code i attached here, and pass it to automatonquery and it executes efficiently).

          its also extremely fast algorithm, the construction of the DFA itself is O where n is the length of the word.

          i tried to make a visual description of how Automaton works for regexp, wildcard, and fuzzy here if you are interested: http://rcmuir.wordpress.com/2009/12/04/finite-state-queries-for-lucene/

          in short, AutomatonQuery can be viewed as a simplified form of the DFA intersection presented in the paper you linked to (section 5.1). The lucene term dictionary itself is a trie of terms in sorted order, therefore it is also a DFA in the mathematical sense.

          Show
          Robert Muir added a comment - Fuad yes, this is the problem with the paper you listed. in my opinion the paper you mentioned is actually better/simpler in some ways than what we are using, but the problem is the term dictionary would have to be restructured for this, or we would have to keep all terms in RAM or something worse. the advantage of the (more complex) 2002 paper is that we can run it on lucene's existing term dictionary, and it will seek to the right spots (i.e. you can take the code i attached here, and pass it to automatonquery and it executes efficiently). its also extremely fast algorithm, the construction of the DFA itself is O where n is the length of the word. i tried to make a visual description of how Automaton works for regexp, wildcard, and fuzzy here if you are interested: http://rcmuir.wordpress.com/2009/12/04/finite-state-queries-for-lucene/ in short, AutomatonQuery can be viewed as a simplified form of the DFA intersection presented in the paper you linked to (section 5.1). The lucene term dictionary itself is a trie of terms in sorted order, therefore it is also a DFA in the mathematical sense.
          Hide
          Fuad Efendi added a comment - - edited

          Ok; I am trying to study DFA&NFA and to compare with LUCENE-2230 (BKTree size is fixed without dependency on distance, but we need to hard-cache it...). What I found is that classic Levenshtein algo is eating 75% CPU, and classic brute-force TermEnum 25%...
          Distance (submitted by end user) must be integer...

          Edited: BKTree memory requirements don't have dependency on distance threshold etc.; but BKTree can help only if threshold is small, otherwise it is similar to full-scan.

          Show
          Fuad Efendi added a comment - - edited Ok; I am trying to study DFA&NFA and to compare with LUCENE-2230 (BKTree size is fixed without dependency on distance, but we need to hard-cache it...). What I found is that classic Levenshtein algo is eating 75% CPU, and classic brute-force TermEnum 25%... Distance (submitted by end user) must be integer... Edited: BKTree memory requirements don't have dependency on distance threshold etc.; but BKTree can help only if threshold is small, otherwise it is similar to full-scan.
          Hide
          Robert Muir added a comment -

          Fuad, this is the wrong paper, it will not work for lucene...this issue uses http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 (linked in the summary).

          Show
          Robert Muir added a comment - Fuad, this is the wrong paper, it will not work for lucene...this issue uses http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 (linked in the summary).
          Show
          Fuad Efendi added a comment - Downloadable article (PDF): http://www.mitpressjournals.org/doi/pdf/10.1162/0891201042544938?cookieSet=1
          Hide
          Robert Muir added a comment -

          You can get all the states just by taking the power set of the subsumption triangle for every base position, and then removing from each set any position thats subsumed by another. Thats what I mean by brute force. But in the paper, they boil this down to nice little i param "tables", extracting some sort of pattern from that process. They give no hint on how they do this, or whether it applicable to greater n's though.

          these parameterized states are based on minimal boundary, and it helps me a bit to start thinking in those terms (instead of E_2 think of 2#1, 3#1, 4#1). Then the parameterized tables from n=2 make a little more sense... (i'm guessing you already know this Mark, just in case)

          Show
          Robert Muir added a comment - You can get all the states just by taking the power set of the subsumption triangle for every base position, and then removing from each set any position thats subsumed by another. Thats what I mean by brute force. But in the paper, they boil this down to nice little i param "tables", extracting some sort of pattern from that process. They give no hint on how they do this, or whether it applicable to greater n's though. these parameterized states are based on minimal boundary, and it helps me a bit to start thinking in those terms (instead of E_2 think of 2#1, 3#1, 4#1). Then the parameterized tables from n=2 make a little more sense... (i'm guessing you already know this Mark, just in case)
          Hide
          Mark Miller added a comment -

          Sorry Earwin - to be clear, we don't actually use chapter 6 - AutomataQuery needs the automata.

          You can get all the states just by taking the power set of the subsumption triangle for every base position, and then removing from each set any position thats subsumed by another. Thats what I mean by brute force. But in the paper, they boil this down to nice little i param "tables", extracting some sort of pattern from that process. They give no hint on how they do this, or whether it applicable to greater n's though. No big deal I guess - the computer can do the brute force method - but I wouldn't be surprised if it starts to bog down at much higher n's.

          Show
          Mark Miller added a comment - Sorry Earwin - to be clear, we don't actually use chapter 6 - AutomataQuery needs the automata. You can get all the states just by taking the power set of the subsumption triangle for every base position, and then removing from each set any position thats subsumed by another. Thats what I mean by brute force. But in the paper, they boil this down to nice little i param "tables", extracting some sort of pattern from that process. They give no hint on how they do this, or whether it applicable to greater n's though. No big deal I guess - the computer can do the brute force method - but I wouldn't be surprised if it starts to bog down at much higher n's.
          Hide
          Robert Muir added a comment -

          Are you referring to the process of generating lev-automata, or to chapter 6, where they skip generating automata and compute its states on the go instead?

          no, he refers to generating the parametric 'tables' for any n

          Show
          Robert Muir added a comment - Are you referring to the process of generating lev-automata, or to chapter 6, where they skip generating automata and compute its states on the go instead? no, he refers to generating the parametric 'tables' for any n
          Hide
          Earwin Burrfoot added a comment -

          I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient.

          Are you referring to the process of generating lev-automata, or to chapter 6, where they skip generating automata and compute its states on the go instead?

          Show
          Earwin Burrfoot added a comment - I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient. Are you referring to the process of generating lev-automata, or to chapter 6, where they skip generating automata and compute its states on the go instead?
          Hide
          Robert Muir added a comment -

          thanks for the pointer Earwin, its very useful to have something else to look at.

          Mark, definitely not losing interest on finishing, but mostly recovering from the brain damage that paper caused me.

          I will say i ran stats on various dictionaries, and I think the following:

          • I think making the defaults for fuzzy (0.5 similarity with 1024 n-best expansions) is very challenging. I actually think the pq size (1024 n-best expansions) is the big kicker here, I mean it seems a little absurd to me... If this was smaller we could support lower n and even not worry about the 0.5 similarity so much, because we would fill up the pq quicker and n would drop fast. But I think adjusting both defaults might make sense.
          • I think we should try to find ways to figuring out what good defaults should be, and this shouldnt be based just on performance (i.e. ocr-damaged test collection, something like that).

          So anyway, I'm really happy with how far this is right now, because if you look at the performance numbers at the top, we were spending 300ms building DFAs for 56 character-long words, and now this is 5ms. just need to go to the next step.

          Show
          Robert Muir added a comment - thanks for the pointer Earwin, its very useful to have something else to look at. Mark, definitely not losing interest on finishing, but mostly recovering from the brain damage that paper caused me. I will say i ran stats on various dictionaries, and I think the following: I think making the defaults for fuzzy (0.5 similarity with 1024 n-best expansions) is very challenging. I actually think the pq size (1024 n-best expansions) is the big kicker here, I mean it seems a little absurd to me... If this was smaller we could support lower n and even not worry about the 0.5 similarity so much, because we would fill up the pq quicker and n would drop fast. But I think adjusting both defaults might make sense. I think we should try to find ways to figuring out what good defaults should be, and this shouldnt be based just on performance (i.e. ocr-damaged test collection, something like that). So anyway, I'm really happy with how far this is right now, because if you look at the performance numbers at the top, we were spending 300ms building DFAs for 56 character-long words, and now this is 5ms. just need to go to the next step.
          Hide
          Mark Miller added a comment -

          If you do take hold of it, do not hesitate to share The original paper and C++ code likewise melt my brain, and I needed the algo in some other place.

          The java impl I was onto was about 75% complete according to the author, but I have not yet looked at the code. Robert was convinced it was a different less efficient algorithm last I heard though.

          We have cracked much of the paper - thats how Robert implemented n=1 here - thats from the paper. The next step is to work out how to construct the tables for n as Robert says above. And store those tables efficiently as they start getting quite large rather fast - though we might only use as high as n=3 or 4 in Lucene - Robert suspects term seeking will outweigh any gains at that point. I think we know how to do the majority of the work for the n case, but I don't really have much/any time for this, so it probably depends on if/when Robert gets to it. If he loses interest on finishing, I def plan to come back to it someday. I'd like to complete my understanding of the paper and see a full n java impl of this in either case. The main piece left that I don't understand fully (computing all possible states for n), can be computed with just a brute force check (thats how the python impl is doing it), so there may not be much more to understand. I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient. The paper shows them for n=1 and n=2.

          Show
          Mark Miller added a comment - If you do take hold of it, do not hesitate to share The original paper and C++ code likewise melt my brain, and I needed the algo in some other place. The java impl I was onto was about 75% complete according to the author, but I have not yet looked at the code. Robert was convinced it was a different less efficient algorithm last I heard though. We have cracked much of the paper - thats how Robert implemented n=1 here - thats from the paper. The next step is to work out how to construct the tables for n as Robert says above. And store those tables efficiently as they start getting quite large rather fast - though we might only use as high as n=3 or 4 in Lucene - Robert suspects term seeking will outweigh any gains at that point. I think we know how to do the majority of the work for the n case, but I don't really have much/any time for this, so it probably depends on if/when Robert gets to it. If he loses interest on finishing, I def plan to come back to it someday. I'd like to complete my understanding of the paper and see a full n java impl of this in either case. The main piece left that I don't understand fully (computing all possible states for n), can be computed with just a brute force check (thats how the python impl is doing it), so there may not be much more to understand. I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient. The paper shows them for n=1 and n=2.
          Hide
          Earwin Burrfoot added a comment -

          by the way, the only open impl of this algorithm i could find is at http://rrette.com/moman.html (ZSpell) in python.

          I recently stumbled upon a C++ey STLey impl -> http://code.google.com/p/patl/

          I might be on the trail of a java impl - get out the hounds!

          If you do take hold of it, do not hesitate to share The original paper and C++ code likewise melt my brain, and I needed the algo in some other place.

          Show
          Earwin Burrfoot added a comment - by the way, the only open impl of this algorithm i could find is at http://rrette.com/moman.html (ZSpell) in python. I recently stumbled upon a C++ey STLey impl -> http://code.google.com/p/patl/ I might be on the trail of a java impl - get out the hounds! If you do take hold of it, do not hesitate to share The original paper and C++ code likewise melt my brain, and I needed the algo in some other place.
          Hide
          Robert Muir added a comment -

          You wrote that some parts of the code are "automatically generated". Is there code available to do this for other n? Else I would move the class to the util.automaton package, as e.g. Regex parser are also there?

          it is not so simple... as n increases the amount of generated code gets very large. also the math to generate the code is very heavy.
          the first 'table' i simply keyed in manually. so it would be nice if these 'tables' could be more compact.... this is one reason i uploaded this code.

          I assume that I can use the generated automaton with AutomatonQuery and would hit all docs, but without scoring?

          yeah for n=1 with this code. we could also 'key in' the code for n=2, too, but its a lot more and it would be better to figure out the smallest representation in java and make a code generator.

          Show
          Robert Muir added a comment - You wrote that some parts of the code are "automatically generated". Is there code available to do this for other n? Else I would move the class to the util.automaton package, as e.g. Regex parser are also there? it is not so simple... as n increases the amount of generated code gets very large. also the math to generate the code is very heavy. the first 'table' i simply keyed in manually. so it would be nice if these 'tables' could be more compact.... this is one reason i uploaded this code. I assume that I can use the generated automaton with AutomatonQuery and would hit all docs, but without scoring? yeah for n=1 with this code. we could also 'key in' the code for n=2, too, but its a lot more and it would be better to figure out the smallest representation in java and make a code generator.
          Hide
          Uwe Schindler added a comment -

          You wrote that some parts of the code are "automatically generated". Is there code available to do this for other n? Else I would move the class to the util.automaton package, as e.g. Regex parser are also there?

          I assume that I can use the generated automaton with AutomatonQuery and would hit all docs, but without scoring?

          Show
          Uwe Schindler added a comment - You wrote that some parts of the code are "automatically generated". Is there code available to do this for other n? Else I would move the class to the util.automaton package, as e.g. Regex parser are also there? I assume that I can use the generated automaton with AutomatonQuery and would hit all docs, but without scoring?
          Hide
          Robert Muir added a comment -

          here is a patch that implements the crazy algorithm.

          i only include the tables for n=1, but the algorithm is general to all n, has tests, works, and is fast.

          the next steps imho are:

          • look at how adding tables for n will help by looking at some dictionary statistics (assuming priority q of 1024)
          • add tables for those values of n
          • add the priority q attribute
          • write the enum (probably adding a little to automatontermsenum so it can be extended to do this)
          • maybe make the code better for this part.
          Show
          Robert Muir added a comment - here is a patch that implements the crazy algorithm. i only include the tables for n=1, but the algorithm is general to all n, has tests, works, and is fast. the next steps imho are: look at how adding tables for n will help by looking at some dictionary statistics (assuming priority q of 1024) add tables for those values of n add the priority q attribute write the enum (probably adding a little to automatontermsenum so it can be extended to do this) maybe make the code better for this part.
          Hide
          Uwe Schindler added a comment -

          Cool! The code looks quite simple (but maybe this is because of n=1). But FuzzyQuery with n>1 are used seldom, or not? And how slow it is?

          Show
          Uwe Schindler added a comment - Cool! The code looks quite simple (but maybe this is because of n=1). But FuzzyQuery with n>1 are used seldom, or not? And how slow it is?
          Hide
          Robert Muir added a comment -

          attached is a testcase that builds n=1 for some string the slow way.
          you can use this to verify that some faster method is correct, via assertEquals

          Show
          Robert Muir added a comment - attached is a testcase that builds n=1 for some string the slow way. you can use this to verify that some faster method is correct, via assertEquals
          Hide
          Mark Miller added a comment -

          I might be on the trail of a java impl - get out the hounds!

          Perhaps I won't be looking over that paper for thanksgiving after all.

          Show
          Mark Miller added a comment - I might be on the trail of a java impl - get out the hounds! Perhaps I won't be looking over that paper for thanksgiving after all.
          Hide
          Mark Miller added a comment -

          Sorry - I attached the wrong file.

          Show
          Mark Miller added a comment - Sorry - I attached the wrong file.
          Hide
          Robert Muir added a comment -

          Mark, from his page it seemed like 0.2 was the version with the generalized edit distance?

          Moman 0.2 is out!
          (2005-07-29) This version add the possibility to use a Levenshtein distance greater than 1. 
          Before, the transitions tables were static, now we build them. 
          It means that in theory, you could ask for a Levenshtein distance of 27! 
          Well, if you have a week ahead of you... 
          
          Show
          Robert Muir added a comment - Mark, from his page it seemed like 0.2 was the version with the generalized edit distance? Moman 0.2 is out! (2005-07-29) This version add the possibility to use a Levenshtein distance greater than 1. Before, the transitions tables were static, now we build them. It means that in theory, you could ask for a Levenshtein distance of 27! Well, if you have a week ahead of you...
          Hide
          Mark Miller added a comment -

          From Moman author:

          Absolutely. Sorry for the missing links. I had some problems with my
          provider which backed up an old version of my website.
          The library is MIT licensed, so feel free to take anything you want. I
          didn't made the subversion available since I was working
          on Daciuk's "Incremental construction of minimal acyclic finite-state
          automata", improving the memory footprint of the algorithm.
          Since I want to be sure the new algorithm is right before publishing
          it, the repository isn't available. Anyway, here's the source
          code (attached). I must warn you that there's no comments at all,
          which is unfortunate, since I was contacted by many people
          lately, that had the same kind of request than yours.

          If you need anything, just don't hesitate to ask.

          Show
          Mark Miller added a comment - From Moman author: Absolutely. Sorry for the missing links. I had some problems with my provider which backed up an old version of my website. The library is MIT licensed, so feel free to take anything you want. I didn't made the subversion available since I was working on Daciuk's "Incremental construction of minimal acyclic finite-state automata", improving the memory footprint of the algorithm. Since I want to be sure the new algorithm is right before publishing it, the repository isn't available. Anyway, here's the source code (attached). I must warn you that there's no comments at all, which is unfortunate, since I was contacted by many people lately, that had the same kind of request than yours. If you need anything, just don't hesitate to ask.
          Hide
          Mark Miller added a comment -

          solr doesnt even allow for a constant prefix with fuzzy - its broken, broken, broken DisMax to the rescue.

          Show
          Mark Miller added a comment - solr doesnt even allow for a constant prefix with fuzzy - its broken, broken, broken DisMax to the rescue.
          Hide
          Robert Muir added a comment -

          mark, you are right.

          plus, the qp does not throw exceptions if you have a fuzzy query with no constant prefix, which is going to be actually worse than even leading *!!!

          Show
          Robert Muir added a comment - mark, you are right. plus, the qp does not throw exceptions if you have a fuzzy query with no constant prefix, which is going to be actually worse than even leading *!!!
          Hide
          Mark Miller added a comment -

          I think it makes sense to allow leading ? - ?????????abc is prob rare enough that its worth it to allow by default. Or perhaps a separate knob to turn it on and leave leading * off.

          Show
          Mark Miller added a comment - I think it makes sense to allow leading ? - ?????????abc is prob rare enough that its worth it to allow by default. Or perhaps a separate knob to turn it on and leave leading * off.
          Hide
          Robert Muir added a comment -

          but you will notice both the Lucene qp and Solr don't allow leading wildcard by default

          if we add this automaton stuff, I think we should reconsider this rule.
          instead of don't allow leading * or ? by default, just don't allow just leading * by default.

          i won't really argue for it though, because a wildcard query of "?????????????abc", probably just as bad as "*abc"

          Show
          Robert Muir added a comment - but you will notice both the Lucene qp and Solr don't allow leading wildcard by default if we add this automaton stuff, I think we should reconsider this rule. instead of don't allow leading * or ? by default, just don't allow just leading * by default. i won't really argue for it though, because a wildcard query of "?????????????abc", probably just as bad as "*abc"
          Hide
          Mark Miller added a comment -

          we find the nice n where this is almost as bad as brute force, and cut her off there?

          yeah, that sounds good if there is a nice transition at a relatively lower n.

          this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around.

          Yeah, and I think this makes sense - but you will notice both the Lucene qp and Solr don't allow leading wildcard by default - the perf is generally bad enough on a large index that we assume you don't want to allow it and force users to "turn on" the option.

          Show
          Mark Miller added a comment - we find the nice n where this is almost as bad as brute force, and cut her off there? yeah, that sounds good if there is a nice transition at a relatively lower n. this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around. Yeah, and I think this makes sense - but you will notice both the Lucene qp and Solr don't allow leading wildcard by default - the perf is generally bad enough on a large index that we assume you don't want to allow it and force users to "turn on" the option.
          Hide
          Robert Muir added a comment -

          I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops.

          you are completely right, it should not drop off a cliff. though i think as n increases, it will get significantly worse, because of so much seeking.
          we find the nice n where this is almost as bad as brute force, and cut her off there?

          this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around.
          Uwe and I are referring this as "dumb" mode versus "smart" mode. I'm gonna add a constructor to the enum, so this can be specified rather than "auto" as well.

          Show
          Robert Muir added a comment - I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops. you are completely right, it should not drop off a cliff. though i think as n increases, it will get significantly worse, because of so much seeking. we find the nice n where this is almost as bad as brute force, and cut her off there? this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around. Uwe and I are referring this as "dumb" mode versus "smart" mode. I'm gonna add a constructor to the enum, so this can be specified rather than "auto" as well.
          Hide
          Mark Miller added a comment -

          if it requires more edits than that, go with the old brute force algorithm?

          Perhaps - but it should be an option. We don't want perf to just drop off a cliff (and thats an understatement on a large index), just because the input crossed some line. I'd almost prefer the input just doesn't work - in my mind FuzzyQuery just doesn't work on large indecies. But far be it from me to take away the option ...

          I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops.

          Show
          Mark Miller added a comment - if it requires more edits than that, go with the old brute force algorithm? Perhaps - but it should be an option. We don't want perf to just drop off a cliff (and thats an understatement on a large index), just because the input crossed some line. I'd almost prefer the input just doesn't work - in my mind FuzzyQuery just doesn't work on large indecies. But far be it from me to take away the option ... I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops.
          Hide
          Robert Muir added a comment -

          Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it.

          I agree with garbage, but we cannot completely replace it anyway. for example what if someone supplies a term of length 54 and asks for distance of 0.5?
          we should not use this algorithm for nonsense like that, in that case I think they should just use the garbage algorithm.

          Here is a quote from that moman page:
          It means that in theory, you could ask for a Levenshtein distance of 27! Well, if you have a week ahead of you...

          we shouldnt burn cycles creating useless tables that will be huge arrays either in fuzzyquery, or whatever. we can't compute all the way up to infinity, this is why i think something like 1,2,3 is "reasonable" , if it requires more edits than that, go with the old brute force algorithm?

          Show
          Robert Muir added a comment - Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. I agree with garbage, but we cannot completely replace it anyway. for example what if someone supplies a term of length 54 and asks for distance of 0.5? we should not use this algorithm for nonsense like that, in that case I think they should just use the garbage algorithm. Here is a quote from that moman page: It means that in theory, you could ask for a Levenshtein distance of 27! Well, if you have a week ahead of you... we shouldnt burn cycles creating useless tables that will be huge arrays either in fuzzyquery, or whatever. we can't compute all the way up to infinity, this is why i think something like 1,2,3 is "reasonable" , if it requires more edits than that, go with the old brute force algorithm?
          Hide
          Mark Miller added a comment - - edited

          we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

          I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?

          edit

          Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code.

          I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version.

          edit

          Not that I can't see people wanting just the tail end to be fuzzy - in that case, fine - do your dirty work I just don't know how much of a real req that is compared to a prefix query. Perhaps its more useful than I'd guess myself. I think we are quite a ways from getting this alg implemented anyway Though I am trying to track down the latest python code.

          Show
          Mark Miller added a comment - - edited we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem? edit Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code. I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version. edit Not that I can't see people wanting just the tail end to be fuzzy - in that case, fine - do your dirty work I just don't know how much of a real req that is compared to a prefix query. Perhaps its more useful than I'd guess myself. I think we are quite a ways from getting this alg implemented anyway Though I am trying to track down the latest python code.
          Hide
          Robert Muir added a comment - - edited

          With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

          you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix.
          we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right.

          btw, I am implying here that we should never do a real levenshtein dynamic programming calculation if we do not have to.
          for example, in my k=1 prototype, it is never used:

          public float difference() {
                if (currentTerm == null || currentTerm.equals(term))
                  return 1.0F;
                else
                  return 1.0F - (1.0F / (Math.max(currentTerm.text().length(), term.text().length())));
              } 
          

          instead, if up to k=3 is required, we should do these steps:

          • equals(), then 1.0F
          • matches k=1 automaton?
          • matches k=2 automaton?
          • matches k=3 automaton?

          even for k=3 i am sure this is likely faster on average than doing levenshtein, especially if the query matches many terms.
          automaton matching is linear time, and just array access.

          Show
          Robert Muir added a comment - - edited With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues. you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right. btw, I am implying here that we should never do a real levenshtein dynamic programming calculation if we do not have to. for example, in my k=1 prototype, it is never used: public float difference() { if (currentTerm == null || currentTerm.equals(term)) return 1.0F; else return 1.0F - (1.0F / ( Math .max(currentTerm.text().length(), term.text().length()))); } instead, if up to k=3 is required, we should do these steps: equals(), then 1.0F matches k=1 automaton? matches k=2 automaton? matches k=3 automaton? even for k=3 i am sure this is likely faster on average than doing levenshtein, especially if the query matches many terms. automaton matching is linear time, and just array access.
          Hide
          Mark Miller added a comment -

          the constant prefix is just an optimization/hack

          Right - which is part of why I am not so worried about handling it with a new impl - its really only there now because the thing is so slow without it - if we really needed to support it, it wouldn't be so bad to fall back to as it is. Though if we can support it with the new impl fine - but I don't think I would go out of the way for it myself.

          for english, it just makes your fuzzy query 26 times faster

          With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

          Show
          Mark Miller added a comment - the constant prefix is just an optimization/hack Right - which is part of why I am not so worried about handling it with a new impl - its really only there now because the thing is so slow without it - if we really needed to support it, it wouldn't be so bad to fall back to as it is. Though if we can support it with the new impl fine - but I don't think I would go out of the way for it myself. for english, it just makes your fuzzy query 26 times faster With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.
          Hide
          Robert Muir added a comment - - edited

          Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary.

          for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism

          so the performance is language-dependent.

          also mark, you said you want to "scale" fuzzy query right? Taking a linear algorithm and dividing it by a constant (alphabet size), which is what constant prefix does, does not improve the scalability. for english, it just makes your fuzzy query 26 times faster. so imo it would be better to improve the real scalability... the constant prefix is just an optimization/hack but does not really solve the problem

          Show
          Robert Muir added a comment - - edited Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary. for my test, the alphabet is pretty small size, [0-9] . With smaller alphabets, say [GACD] , the automaton enumeration becomes even more effective over any "prefix" mechanism so the performance is language-dependent. also mark, you said you want to "scale" fuzzy query right? Taking a linear algorithm and dividing it by a constant (alphabet size), which is what constant prefix does, does not improve the scalability. for english, it just makes your fuzzy query 26 times faster. so imo it would be better to improve the real scalability... the constant prefix is just an optimization/hack but does not really solve the problem
          Hide
          Mark Miller added a comment -

          Right, I wouldn't expect it to be great with a single char prefix - especially with the random terms your making. But I think thats a much worse case than a slightly longer prefix with a normal term distribution.

          Show
          Mark Miller added a comment - Right, I wouldn't expect it to be great with a single char prefix - especially with the random terms your making. But I think thats a much worse case than a slightly longer prefix with a normal term distribution.
          Hide
          Robert Muir added a comment -

          Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though.

          not so bad, but not so good either. see my wildcard results, for example (on the 10M equally distributed terms)

          Pattern Iter AvgHits AvgMS (old) AvgMS (new)
          N?N?N?N 10 1000.0 288.6 38.5
          Show
          Robert Muir added a comment - Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though. not so bad, but not so good either. see my wildcard results, for example (on the 10M equally distributed terms) Pattern Iter AvgHits AvgMS (old) AvgMS (new) N?N?N?N 10 1000.0 288.6 38.5
          Hide
          Mark Miller added a comment -

          Mark, they would get large fast, but i think we only need say 1,2,3.

          Ah, right - you mention that above in the summary.

          Another twist, is that we have to support the 'constant prefix' as well.

          Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though.

          Show
          Mark Miller added a comment - Mark, they would get large fast, but i think we only need say 1,2,3. Ah, right - you mention that above in the summary. Another twist, is that we have to support the 'constant prefix' as well. Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though.
          Hide
          Robert Muir added a comment -

          Another twist, is that we have to support the 'constant prefix' as well.
          the easiest way would be that if the user asks for this, we intersect the automaton with a 'prefix automaton' such as abc.*

          but the BasicOperations.intersection is quadratic in # of states, because it supports both NFA and DFA
          we might need to implement a better DFA-only intersection alg for this constant prefix, so that supplying constant prefix doesnt actually make things slower

          Show
          Robert Muir added a comment - Another twist, is that we have to support the 'constant prefix' as well. the easiest way would be that if the user asks for this, we intersect the automaton with a 'prefix automaton' such as abc.* but the BasicOperations.intersection is quadratic in # of states, because it supports both NFA and DFA we might need to implement a better DFA-only intersection alg for this constant prefix, so that supplying constant prefix doesnt actually make things slower
          Hide
          Robert Muir added a comment -

          Mark, they would get large fast, but i think we only need say 1,2,3.
          once you go high K, it will be many disk seeks anyway, and 'linear' mode like it does now will be probably faster.

          Show
          Robert Muir added a comment - Mark, they would get large fast, but i think we only need say 1,2,3. once you go high K, it will be many disk seeks anyway, and 'linear' mode like it does now will be probably faster.
          Hide
          Mark Miller added a comment -

          we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

          I wonder what the best way to handle this would be - these transition tables appear to get very large very fast. Looks like the python stuff is now calculating them on the fly. You wouldn't want to load them all up into RAM for those not using them. Perhaps some sort of lru cache load on demand or something - though can't really fully warm up that way.

          Show
          Mark Miller added a comment - we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. I wonder what the best way to handle this would be - these transition tables appear to get very large very fast. Looks like the python stuff is now calculating them on the fly. You wouldn't want to load them all up into RAM for those not using them. Perhaps some sort of lru cache load on demand or something - though can't really fully warm up that way.
          Hide
          Robert Muir added a comment -

          I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway

          you will need more wine. If it starts to catch, I'll be happy to offer help with use of the automaton API, etc.

          Show
          Robert Muir added a comment - I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway you will need more wine. If it starts to catch, I'll be happy to offer help with use of the automaton API, etc.
          Hide
          Mark Miller added a comment -

          I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway

          Show
          Mark Miller added a comment - I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway
          Hide
          Robert Muir added a comment -

          I hope its obvious from the benchmark why we shouldn't use the crappy prototype, and optimize K=1 (which is probably very common).
          if someone has a small index, with very long terms (bio sequences, or something), then fuzzy would actually get slower.

          Show
          Robert Muir added a comment - I hope its obvious from the benchmark why we shouldn't use the crappy prototype, and optimize K=1 (which is probably very common). if someone has a small index, with very long terms (bio sequences, or something), then fuzzy would actually get slower.
          Hide
          Robert Muir added a comment -

          by the way, the only open impl of this algorithm i could find is at http://rrette.com/moman.html (ZSpell) in python.
          the download link for 0.2 is not available, and it appears from the website this is the version with the updated algorithm.

          Show
          Robert Muir added a comment - by the way, the only open impl of this algorithm i could find is at http://rrette.com/moman.html (ZSpell) in python. the download link for 0.2 is not available, and it appears from the website this is the version with the updated algorithm.
          Hide
          Uwe Schindler added a comment -

          ha - too much wine last night to laugh that hard this morning. Painful.

          I need more beer, after that 3.0.0 problem with the AttributeSource API (LUCENE-2088).

          Show
          Uwe Schindler added a comment - ha - too much wine last night to laugh that hard this morning. Painful. I need more beer, after that 3.0.0 problem with the AttributeSource API ( LUCENE-2088 ).
          Hide
          Robert Muir added a comment -

          modify description to be readable, K is number of edits, N refers to length of word.

          Show
          Robert Muir added a comment - modify description to be readable, K is number of edits, N refers to length of word.
          Hide
          Mark Miller added a comment -

          (i will assign this to him, I know he is itching to write that nasty algorithm

          ha - too much wine last night to laugh that hard this morning. Painful.

          Show
          Mark Miller added a comment - (i will assign this to him, I know he is itching to write that nasty algorithm ha - too much wine last night to laugh that hard this morning. Painful.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development