Lucene - Core
  1. Lucene - Core
  2. LUCENE-1124

short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.1, 3.0
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I found this (unreplied to) email floating around in my Lucene folder from during the holidays...

      From: Timo Nentwig
      To: java-dev
      Subject: Fuzzy makes no sense for short tokens
      Date: Mon, 31 Dec 2007 16:01:11 +0100
      Message-Id: <200712311601.12255.lucene@nitwit.de>
      
      Hi!
      
      it generally makes no sense to search fuzzy for short tokens because changing
      even only a single character of course already results in a high edit
      distance. So it actually only makes sense in this case:
      
                 if( token.length() > 1f / (1f - minSimilarity) )
      
      E.g. changing one character in a 3-letter token (foo) results in an edit
      distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
      we can save all the expensive rewrite() logic.
      

      I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity. (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

      1. LUCENE-1124.patch
        3 kB
        Michael McCandless
      2. LUCENE-1124.patch
        3 kB
        Mark Miller
      3. LUCENE-1124.patch
        3 kB
        Mark Miller
      4. LUCENE-1124.patch
        3 kB
        Mark Miller

        Activity

        Hide
        Otis Gospodnetic added a comment -

        This makes sense, Timo. Could you please attach a patch + a simple unit test?

        Show
        Otis Gospodnetic added a comment - This makes sense, Timo. Could you please attach a patch + a simple unit test?
        Hide
        Mark Miller added a comment -

        This optimization is correct. Highlights some interesting things about fuzzy query as well i.e. if you put a minsim of 0.9, your query term has to be over 10 chars to have any hope of getting a match. For the default of 0.5 its 2 chars, so in the common case the optimization doesn't do much good, and you do have to pay for the check every time no matter what. For larger minsim values though, this will turn a lot of fuzz queries into no ops.

        • Mark
        Show
        Mark Miller added a comment - This optimization is correct. Highlights some interesting things about fuzzy query as well i.e. if you put a minsim of 0.9, your query term has to be over 10 chars to have any hope of getting a match. For the default of 0.5 its 2 chars, so in the common case the optimization doesn't do much good, and you do have to pay for the check every time no matter what. For larger minsim values though, this will turn a lot of fuzz queries into no ops. Mark
        Hide
        Mark Miller added a comment -

        Computing the needed term length in the constructor is probably better.

        Show
        Mark Miller added a comment - Computing the needed term length in the constructor is probably better.
        Hide
        Hoss Man added a comment -

        I don't deal with FuzzyQueries much, but skimming this issue it seems to touch on a lot of hte same things that spawned the creation of the "mm" syntax for specifying the "minNumberShouldMatch" value on BooleanQueries in the Solr dismax query parser...

        http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

        ...perhaps something similar could be used to allow people to specify simpel expressions for dictating the "fuzzyiness" of short input vs medium length input, vs long input.

        Show
        Hoss Man added a comment - I don't deal with FuzzyQueries much, but skimming this issue it seems to touch on a lot of hte same things that spawned the creation of the "mm" syntax for specifying the "minNumberShouldMatch" value on BooleanQueries in the Solr dismax query parser... http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html ...perhaps something similar could be used to allow people to specify simpel expressions for dictating the "fuzzyiness" of short input vs medium length input, vs long input.
        Hide
        Mark Miller added a comment -

        Updated to trunk.

        Im going to commit in few days if no one objects.

        Show
        Mark Miller added a comment - Updated to trunk. Im going to commit in few days if no one objects.
        Hide
        Mark Miller added a comment -

        Thanks! Committed.

        Show
        Mark Miller added a comment - Thanks! Committed.
        Hide
        Michael McCandless added a comment -

        This fix breaks the case when the exact term is present in the index.

        Show
        Michael McCandless added a comment - This fix breaks the case when the exact term is present in the index.
        Hide
        Michael McCandless added a comment -

        Attach patch (based on 2.9) showing the bug, along with the fix. Instead of rewriting to empty BooleanQuery when prefix term is not long enough, I rewrite to TermQuery with that prefix. This way the exact term matches.

        I'll commit shortly to trunk & 2.9.x.

        Show
        Michael McCandless added a comment - Attach patch (based on 2.9) showing the bug, along with the fix. Instead of rewriting to empty BooleanQuery when prefix term is not long enough, I rewrite to TermQuery with that prefix. This way the exact term matches. I'll commit shortly to trunk & 2.9.x.
        Hide
        Michael McCandless added a comment -

        Bulk close all 2.9.1 issues.

        Show
        Michael McCandless added a comment - Bulk close all 2.9.1 issues.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Hoss Man
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development