Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1124

short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.1, 3.0
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I found this (unreplied to) email floating around in my Lucene folder from during the holidays...

      From: Timo Nentwig
      To: java-dev
      Subject: Fuzzy makes no sense for short tokens
      Date: Mon, 31 Dec 2007 16:01:11 +0100
      Message-Id: <200712311601.12255.lucene@nitwit.de>
      
      Hi!
      
      it generally makes no sense to search fuzzy for short tokens because changing
      even only a single character of course already results in a high edit
      distance. So it actually only makes sense in this case:
      
                 if( token.length() > 1f / (1f - minSimilarity) )
      
      E.g. changing one character in a 3-letter token (foo) results in an edit
      distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
      we can save all the expensive rewrite() logic.
      

      I don't know much about FuzzyQueries, but this reasoning seems sound ... FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in the event that the input token is shorter then some simple math on the minSimilarity. (i'm not smart enough to be certain that the math above is right however ... it's been a while since i looked at Levenstein distances ... tests needed)

        Attachments

        1. LUCENE-1124.patch
          3 kB
          Michael McCandless
        2. LUCENE-1124.patch
          3 kB
          Mark Miller
        3. LUCENE-1124.patch
          3 kB
          Mark Miller
        4. LUCENE-1124.patch
          3 kB
          Mark Miller

          Activity

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              hossman Hoss Man
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: