Lucene - Core
  1. Lucene - Core
  2. LUCENE-4024

FuzzyQuery should never do edit distance > 2

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Edit distance 1 and 2 are now very very fast compared to 3.x (100X-200X faster) ... but edit distance 3 will fallback to the super-slow scan all terms in 3.x, which is not graceful degradation.

      Not sure how to fix it ... mabye we have a SlowFuzzyQuery? And FuzzyQuery throws exc if you try to ask it to be slow? Or, we add boolean (off by default) that you must turn on to allow slow one..?

      1. LUCENE-4024.patch
        178 kB
        Robert Muir

        Activity

        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Chris Male added a comment -

        I've opened LUCENE-4040 to try to improve the documentation for QPs.

        Show
        Chris Male added a comment - I've opened LUCENE-4040 to try to improve the documentation for QPs.
        Hide
        Robert Muir added a comment -

        Should there be a separate issue to update doc for the query parser(s) beyond Lucene (I see that Lucene Query Parser is updated)?

        All queryparsers were updated. But ClassicQueryParser (the lucene one) is the only one that really
        documents its syntax though, so thats where the doc update occurred.

        The rest of the QPs mostly re-use ClassicQueryParser's syntax and don't document any different syntax.

        Seriously (unrelated) if you have javadocs for that queryparser module, in general (especially classes with no javadocs at all!!!!) just throw them up whereever, email list, this issue, some new issue, I dont care. Ill commit them.

        Show
        Robert Muir added a comment - Should there be a separate issue to update doc for the query parser(s) beyond Lucene (I see that Lucene Query Parser is updated)? All queryparsers were updated. But ClassicQueryParser (the lucene one) is the only one that really documents its syntax though, so thats where the doc update occurred. The rest of the QPs mostly re-use ClassicQueryParser's syntax and don't document any different syntax. Seriously (unrelated) if you have javadocs for that queryparser module, in general (especially classes with no javadocs at all!!!!) just throw them up whereever, email list, this issue, some new issue, I dont care. Ill commit them.
        Hide
        Steve Rowe added a comment -

        I updated svn and see the Javadoc now.

        Jack, do you know about the commit notifications mailing list? If not, see http://lucene.apache.org/core/discussion.html for details.

        Show
        Steve Rowe added a comment - I updated svn and see the Javadoc now. Jack, do you know about the commit notifications mailing list? If not, see http://lucene.apache.org/core/discussion.html for details.
        Hide
        Walter Underwood added a comment -

        Yes, that is exactly where I looked, and I missed it, sorry. Up late with a barfing child and a barfing Solr server in prod.

        Show
        Walter Underwood added a comment - Yes, that is exactly where I looked, and I missed it, sorry. Up late with a barfing child and a barfing Solr server in prod.
        Hide
        Jack Krupansky added a comment - - edited

        I updated svn and see the Javadoc now. A notation in CHANGES.txt would be nice too since this is a user-visiable issue. Should there be a separate issue to update doc for the query parser(s) beyond Lucene (I see that Lucene Query Parser is updated)?

        Show
        Jack Krupansky added a comment - - edited I updated svn and see the Javadoc now. A notation in CHANGES.txt would be nice too since this is a user-visiable issue. Should there be a separate issue to update doc for the query parser(s) beyond Lucene (I see that Lucene Query Parser is updated)?
        Hide
        Robert Muir added a comment -

        Did you look at mmmm say the changes to FuzzyQuery.java? package.html?

         /** Implements the fuzzy search query. The similarity measurement
        - * is based on the Levenshtein (edit distance) algorithm.
        + * is based on the Damerau-Levenshtein (optimal string alignment) algorithm.
          * 
        
        +   * @param transpositions true if transpositions should be treated as a primitive
        +   *        edit operation. If this is false, comparisons will implement the classic
        +   *        Levenshtein algorithm.
        
        -<p>Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. 
        +<p>Lucene supports fuzzy searches based on Damerau-Levenshtein Distance. 
        
        Show
        Robert Muir added a comment - Did you look at mmmm say the changes to FuzzyQuery.java? package.html? /** Implements the fuzzy search query. The similarity measurement - * is based on the Levenshtein (edit distance) algorithm. + * is based on the Damerau-Levenshtein (optimal string alignment) algorithm. * + * @param transpositions true if transpositions should be treated as a primitive + * edit operation. If this is false, comparisons will implement the classic + * Levenshtein algorithm. -<p>Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. +<p>Lucene supports fuzzy searches based on Damerau-Levenshtein Distance.
        Hide
        Walter Underwood added a comment -

        Looking at the commit, the Javadoc does not give the default for transpositions. Reading the code, it defaults to true, which is a behavior change. That's fine, but it should be documented.

        Like Jack, I think it would be a good idea to specifically say Damerau-Levenshtein.

        Show
        Walter Underwood added a comment - Looking at the commit, the Javadoc does not give the default for transpositions. Reading the code, it defaults to true, which is a behavior change. That's fine, but it should be documented. Like Jack, I think it would be a good idea to specifically say Damerau-Levenshtein.
        Hide
        Robert Muir added a comment -

        You have to look at the commit, not the patch (which was missing javadocs). See subversion commits tab.

        Show
        Robert Muir added a comment - You have to look at the commit, not the patch (which was missing javadocs). See subversion commits tab.
        Hide
        Jack Krupansky added a comment -

        The levenshtein distance has changed to include transposition as a primitive edit

        Is there any user-visible doc about that change? I don't see any mention in CHANGES.txt or the Javadoc for FuzzyQuery.

        At least according to the Wikipedia, the addition of transposition as a primitive would be referred to as the "Damerau–Levenshtein distance".
        http://en.wikipedia.org/wiki/Levenshtein_distance
        http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

        At least the Javadoc for FuzzyQuery should have a link to whatever the technically correct specification is.

        A few examples would be nice as well.

        Show
        Jack Krupansky added a comment - The levenshtein distance has changed to include transposition as a primitive edit Is there any user-visible doc about that change? I don't see any mention in CHANGES.txt or the Javadoc for FuzzyQuery. At least according to the Wikipedia, the addition of transposition as a primitive would be referred to as the "Damerau–Levenshtein distance". http://en.wikipedia.org/wiki/Levenshtein_distance http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance At least the Javadoc for FuzzyQuery should have a link to whatever the technically correct specification is. A few examples would be nice as well.
        Hide
        Robert Muir added a comment -

        Distance two is necessary to handle transpositions

        Thats not true. The levenshtein distance has changed to include transposition as a primitive edit.

        Show
        Robert Muir added a comment - Distance two is necessary to handle transpositions Thats not true. The levenshtein distance has changed to include transposition as a primitive edit.
        Hide
        Walter Underwood added a comment -

        I'm not sure that the floating point spec should be deprecated. I'd like to have the option of distance 1 for short terms and distance 2 for longer ones. Distance two is necessary to handle transpositions, but gets a very broad match from short terms. Doing that through the float spec might be clumsy, but it would work.

        Show
        Walter Underwood added a comment - I'm not sure that the floating point spec should be deprecated. I'd like to have the option of distance 1 for short terms and distance 2 for longer ones. Distance two is necessary to handle transpositions, but gets a very broad match from short terms. Doing that through the float spec might be clumsy, but it would work.
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        rmuir committed 1334819 (23 files)
        Reviews: none

        LUCENE-4024: FuzzyQuery should never do edit distance > 2

        Lucene trunk
        Hide
        Michael McCandless added a comment -

        +1

        Nice to see LinearFuzzyTE moved out of core!

        Show
        Michael McCandless added a comment - +1 Nice to see LinearFuzzyTE moved out of core!
        Robert Muir made changes -
        Attachment LUCENE-4024.patch [ 12525335 ]
        Hide
        Robert Muir added a comment -

        I agree: this crazy floating point specification of distance is hairy to be compatible with 3.x

        But i think this is all a huge trap, attached is a patch that:

        • removes slow capability from FuzzyTermsEnum
        • Cleans up FuzzyQuery: removes float-ctors, allows transpositions as primitive edits, etc.
        • adds a deprecated SlowFuzzyQuery to sandbox/ that has the old ctors
        • adds a deprecated SlowFuzzyTermsEnum that it uses, which extends FuzzyTermsEnum and adds slowness.

        I added a helper static method (deprecated) to FuzzyQuery that converts from the old float sim stuff to number of edits, but ceilinged at what automata support (this is used to easily cut over queryparsers).

        All tests pass but patch needs javadocs. Especially I think we should adjust the query syntax and mark the old ~0.xxx stuff as deprecated, since qps can already do do ~1 ~2 now. Then we can really cleanup for 5.0

        P.S. patch is huge since i didnt use SVN adds/removes, but makes it easy to apply.

        Show
        Robert Muir added a comment - I agree: this crazy floating point specification of distance is hairy to be compatible with 3.x But i think this is all a huge trap, attached is a patch that: removes slow capability from FuzzyTermsEnum Cleans up FuzzyQuery: removes float-ctors, allows transpositions as primitive edits, etc. adds a deprecated SlowFuzzyQuery to sandbox/ that has the old ctors adds a deprecated SlowFuzzyTermsEnum that it uses, which extends FuzzyTermsEnum and adds slowness. I added a helper static method (deprecated) to FuzzyQuery that converts from the old float sim stuff to number of edits, but ceilinged at what automata support (this is used to easily cut over queryparsers). All tests pass but patch needs javadocs. Especially I think we should adjust the query syntax and mark the old ~0.xxx stuff as deprecated, since qps can already do do ~1 ~2 now. Then we can really cleanup for 5.0 P.S. patch is huge since i didnt use SVN adds/removes, but makes it easy to apply.
        Michael McCandless made changes -
        Field Original Value New Value
        Summary FuzzyQuery should never do edit distance > 3 FuzzyQuery should never do edit distance > 2
        Michael McCandless created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development