Lucene - Core
  1. Lucene - Core
  2. LUCENE-5033

SlowFuzzyQuery appears to fail with edit distance >=3 in some cases

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.4, 6.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Levenshtein edit btwn "monday" and "montugu" should be 4. The following shows a query with "sim" set to 3, and there is a hit.

      public void testFuzzinessLong2() throws Exception

      { Directory directory = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random(), directory); addDoc("monday", writer); IndexReader reader = writer.getReader(); IndexSearcher searcher = newSearcher(reader); writer.close(); SlowFuzzyQuery query; query = new SlowFuzzyQuery(new Term("field", "montugu"), 3, 0); ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; assertEquals(0, hits.length); }
      1. LUCENE-5033.patch
        12 kB
        Tim Allison
      2. LUCENE-5033.patch
        9 kB
        Tim Allison

        Activity

        Hide
        Tim Allison added a comment -

        First draft of patch attached. Let me know how this looks. Thank you.

        Show
        Tim Allison added a comment - First draft of patch attached. Let me know how this looks. Thank you.
        Hide
        Michael McCandless added a comment -

        Thanks Tim! This looks like a great improvement: I like factoring out calcDistance from calcSimilarity.

        And I like that we now take raw into account when figuring out which comparison to make to accept the term or not. Maybe we could improve it a bit: if raw is true we don't need to calcSimilarity right?

        For my sanity ... where exactly was the bug in the original code?

        Show
        Michael McCandless added a comment - Thanks Tim! This looks like a great improvement: I like factoring out calcDistance from calcSimilarity. And I like that we now take raw into account when figuring out which comparison to make to accept the term or not. Maybe we could improve it a bit: if raw is true we don't need to calcSimilarity right? For my sanity ... where exactly was the bug in the original code?
        Hide
        Tim Allison added a comment -

        Thank you for your quick response!

        I, too, was hoping to avoid calcSimilarity if raw is true, but I think we need it to calculate the boost. Let me know if I'm missing something.

        The bug in the original code was that FilteredTermsEnum sets minSimilarity to 0 when the user-specified minSimilarity is >= 1.0f. So, in SlowFuzzyTermsEnum, similarity (unless it was Float.NEGATIVE_INFINITY) was typically > minSimilarity no matter its value. In other words, when the client code made the call with minSimilarity >=1.0f, that value was correctly recorded in maxEdits, but maxEdits wasn't the determining factor in whether SlowFuzzyTerms accepted a term.

        Show
        Tim Allison added a comment - Thank you for your quick response! I, too, was hoping to avoid calcSimilarity if raw is true, but I think we need it to calculate the boost. Let me know if I'm missing something. The bug in the original code was that FilteredTermsEnum sets minSimilarity to 0 when the user-specified minSimilarity is >= 1.0f. So, in SlowFuzzyTermsEnum, similarity (unless it was Float.NEGATIVE_INFINITY) was typically > minSimilarity no matter its value. In other words, when the client code made the call with minSimilarity >=1.0f, that value was correctly recorded in maxEdits, but maxEdits wasn't the determining factor in whether SlowFuzzyTerms accepted a term.
        Hide
        Robert Muir added a comment -

        Doing an explicit levenshtein calculation here sort of defeats the entire purpose of having levenshtein automata at all!

        Show
        Robert Muir added a comment - Doing an explicit levenshtein calculation here sort of defeats the entire purpose of having levenshtein automata at all!
        Hide
        Michael McCandless added a comment -

        I, too, was hoping to avoid calcSimilarity if raw is true, but I think we need it to calculate the boost. Let me know if I'm missing something.

        Ahh, you're right ... I missed that. OK.

        The bug in the original code was that FilteredTermsEnum sets minSimilarity to 0 when the user-specified minSimilarity is >= 1.0f. So, in SlowFuzzyTermsEnum, similarity (unless it was Float.NEGATIVE_INFINITY) was typically > minSimilarity no matter its value. In other words, when the client code made the call with minSimilarity >=1.0f, that value was correctly recorded in maxEdits, but maxEdits wasn't the determining factor in whether SlowFuzzyTerms accepted a term.

        Oh, I see: FuzzyTermsEnum does this in its ctor, and SlowFuzzyTermsEnum extends that. Now I understand the bug ... thanks.

        Doing an explicit levenshtein calculation here sort of defeats the entire purpose of having levenshtein automata at all!

        But this fix only applies in cases (edit distance > 2) where automaton's don't, I think? (The fixes are to LinearFuzzyTermsEnum).

        Show
        Michael McCandless added a comment - I, too, was hoping to avoid calcSimilarity if raw is true, but I think we need it to calculate the boost. Let me know if I'm missing something. Ahh, you're right ... I missed that. OK. The bug in the original code was that FilteredTermsEnum sets minSimilarity to 0 when the user-specified minSimilarity is >= 1.0f. So, in SlowFuzzyTermsEnum, similarity (unless it was Float.NEGATIVE_INFINITY) was typically > minSimilarity no matter its value. In other words, when the client code made the call with minSimilarity >=1.0f, that value was correctly recorded in maxEdits, but maxEdits wasn't the determining factor in whether SlowFuzzyTerms accepted a term. Oh, I see: FuzzyTermsEnum does this in its ctor, and SlowFuzzyTermsEnum extends that. Now I understand the bug ... thanks. Doing an explicit levenshtein calculation here sort of defeats the entire purpose of having levenshtein automata at all! But this fix only applies in cases (edit distance > 2) where automaton's don't, I think? (The fixes are to LinearFuzzyTermsEnum).
        Hide
        Tim Allison added a comment - - edited

        Robert,

        I agree that it appears to, but if you want a distance > 2, the current levenshtein automaton doesn't allow that (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html?showComment=1303598602291#c3051732466052117784). The classic QueryParser silently converts distances > 2 to 2.

        If I understand SlowFuzzyQuery correctly, it uses the levenshtein automaton for distances <= 2, but it runs brute force if the distance is > 2.

        My personal preference would be to undeprecate SlowFuzzyQuery (certainly leave it in the sandbox) because it offers a capability that the current levenshtein automaton doesn't. In cases where the indices are very large, it wouldn't make sense to expose distance > 2 capability; but on small to medium indices, there are use cases that require it.

        Show
        Tim Allison added a comment - - edited Robert, I agree that it appears to, but if you want a distance > 2, the current levenshtein automaton doesn't allow that ( http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html?showComment=1303598602291#c3051732466052117784 ). The classic QueryParser silently converts distances > 2 to 2. If I understand SlowFuzzyQuery correctly, it uses the levenshtein automaton for distances <= 2, but it runs brute force if the distance is > 2. My personal preference would be to undeprecate SlowFuzzyQuery (certainly leave it in the sandbox) because it offers a capability that the current levenshtein automaton doesn't. In cases where the indices are very large, it wouldn't make sense to expose distance > 2 capability; but on small to medium indices, there are use cases that require it.
        Hide
        Tim Allison added a comment - - edited

        Updated patch. Added short circuits to avoid calculating similarity when not necessary. Corrected term.length to text.length in calcSimilarity call. Activated old tests that test for edit distance matches where the edit distances are greater than the query term length.

        Show
        Tim Allison added a comment - - edited Updated patch. Added short circuits to avoid calculating similarity when not necessary. Corrected term.length to text.length in calcSimilarity call. Activated old tests that test for edit distance matches where the edit distances are greater than the query term length.
        Hide
        Michael McCandless added a comment -

        Thanks Tim, new patch looks great!

        Show
        Michael McCandless added a comment - Thanks Tim, new patch looks great!
        Hide
        Michael McCandless added a comment -

        Thanks Tim!

        Show
        Michael McCandless added a comment - Thanks Tim!
        Hide
        Tim Allison added a comment -

        Mike,
        Thank you for your feedback and quick response!

        Show
        Tim Allison added a comment - Mike, Thank you for your feedback and quick response!
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Unassigned
            Reporter:
            Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development