Lucene - Core
  1. Lucene - Core
  2. LUCENE-124

Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      According to the website's "Query Syntax" page, fuzzy searches are given a
      boost of 0.2. I've found this not to be the case, and have seen situations where
      exact matches have lower relevance scores than fuzzy matches.

      Rather than getting a boost of 0.2, it appears that all variations on the term
      are first found in the model, where dist* > 0.5.

      • dist = levenshteinDistance / length of min(termlength, variantlength)

      This then leads to a boolean OR search of all the variant terms, each of whose
      boost is set to (dist - 0.5)*2 for that variant.

      The upshot of all of this is that there are many cases where a fuzzy match will
      get a higher relevance score than an exact match.

      See this email for a test case to reproduce this anomalous behaviour.
      http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02819.html

      Here is a candidate patch to address the issue -

          • lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09
            13:47:54 2002
          • lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri
            Mar 14 11:37:20 2003
            ***************
          • 99,105 ****
            }

      final protected float difference()

      { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); }

      final public boolean endEnum()

      { --- 99,109 ---- }

      final protected float difference() {
      ! if (distance == 1.0)

      { ! return 1.0f; ! }

      ! else
      ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
      }

      final public boolean endEnum() {
      ***************

          • 111,117 ****
            ******************************/

      public static final double FUZZY_THRESHOLD = 0.5;
      ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD);

      /**
      Finds and returns the smallest of three integers
      — 115,121 ----
      ******************************/

      public static final double FUZZY_THRESHOLD = 0.5;
      ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
      FUZZY_THRESHOLD));

      /**
      Finds and returns the smallest of three integers

      1. LUCENE-124.patch
        7 kB
        Robert Muir
      2. LUCENE-124.patch
        5 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Otis Gospodnetic added a comment -

          Cormac, the problem you described at
          http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02819.html seems clear.

          I do not see any mention of 0.2f boost in any of the Fuzzy classes. This is a
          documentation bug, which I will fix soon.

          However, your fix may still be valid, as exact matches should never have lower
          score than fuzzy ones. I would be very greatful if you could submit that
          levtest class as a JUnit test, so we can see the bug clearly before applying
          your patch.

          Finally, why did you choose the boost of 0.2? Why not 0.1 or 0.3 for example?
          And is it possible that choosing a random number such as 0.2, will work for
          your test document set, but may not work for some other cases?

          Thank you.

          Show
          Otis Gospodnetic added a comment - Cormac, the problem you described at http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02819.html seems clear. I do not see any mention of 0.2f boost in any of the Fuzzy classes. This is a documentation bug, which I will fix soon. However, your fix may still be valid, as exact matches should never have lower score than fuzzy ones. I would be very greatful if you could submit that levtest class as a JUnit test, so we can see the bug clearly before applying your patch. Finally, why did you choose the boost of 0.2? Why not 0.1 or 0.3 for example? And is it possible that choosing a random number such as 0.2, will work for your test document set, but may not work for some other cases? Thank you.
          Hide
          Cormac Twomey added a comment -

          I will work on massaging my test case into a JUnit test.

          Meanwhile, I chose the value of 0.2 simply because it is the documented
          behavior, and therefore I considered that to be the expected, even desired,
          behavior. That said, it does appear to be a randomly chosen value, although not
          chosen by me

          Following the logic of how the scoring mechanism works (or at least my
          understanding of it), this is not a universal fix, but rather as I state in my
          original email on lucene-dev, it mitigates the problem. I chose the fix simply
          as it brought the functionality in line with documented behavior.

          The essence of the problem is the battle in scoring between levenshtein distance
          and term frequency - high frequency terms are scored lower than low frequency
          terms. A good example of a low frequency term is a typo in a document. If the
          original correctly spelled word has a very high frequency, the misspelled word
          will come out on top, due to its significantly lower frequency.

          By setting the boost to 0.2, We at least make it 5 times harder (in terms of
          frequency) for the misspelled item to appear ahead of the correctly spelled
          item. But this clearly means that it will still happen.

          --Cormac

          Show
          Cormac Twomey added a comment - I will work on massaging my test case into a JUnit test. Meanwhile, I chose the value of 0.2 simply because it is the documented behavior, and therefore I considered that to be the expected, even desired, behavior. That said, it does appear to be a randomly chosen value, although not chosen by me Following the logic of how the scoring mechanism works (or at least my understanding of it), this is not a universal fix, but rather as I state in my original email on lucene-dev, it mitigates the problem. I chose the fix simply as it brought the functionality in line with documented behavior. The essence of the problem is the battle in scoring between levenshtein distance and term frequency - high frequency terms are scored lower than low frequency terms. A good example of a low frequency term is a typo in a document. If the original correctly spelled word has a very high frequency, the misspelled word will come out on top, due to its significantly lower frequency. By setting the boost to 0.2, We at least make it 5 times harder (in terms of frequency) for the misspelled item to appear ahead of the correctly spelled item. But this clearly means that it will still happen. --Cormac
          Hide
          Mark Harwood added a comment -

          I would suggest this is a duplicate of http://issues.apache.org/jira/browse/LUCENE-329

          The idf rating of expanded terms should be the same and not favour rarer terms. I suggest that this applies to all auto-expanding searches eg range queries.

          Should we drop this bug as a duplicate?

          Show
          Mark Harwood added a comment - I would suggest this is a duplicate of http://issues.apache.org/jira/browse/LUCENE-329 The idf rating of expanded terms should be the same and not favour rarer terms. I suggest that this applies to all auto-expanding searches eg range queries. Should we drop this bug as a duplicate?
          Hide
          Robert Muir added a comment -

          Here is a patch, with a test for the issue.

          This patch adds TOP_TERMS_CONSTANT_BOOLEAN_REWRITE to complement TOP_TERMS_SCORING_BOOLEAN_REWRITE.

          Note: this solution is different than LUCENE-329, but I think this rewrite method could be useful for other queries as well.

          example usage:

          FuzzyQuery query = new FuzzyQuery(new Term("field", "Lucene"));
          query.setRewriteMethod(MultiTermQuery.TOP_TERMS_CONSTANT_BOOLEAN_REWRITE);
          ScoreDoc[] hits = searcher.search(query, ...)
           ...
          
          Show
          Robert Muir added a comment - Here is a patch, with a test for the issue. This patch adds TOP_TERMS_CONSTANT_BOOLEAN_REWRITE to complement TOP_TERMS_SCORING_BOOLEAN_REWRITE. Note: this solution is different than LUCENE-329 , but I think this rewrite method could be useful for other queries as well. example usage: FuzzyQuery query = new FuzzyQuery( new Term( "field" , "Lucene" )); query.setRewriteMethod(MultiTermQuery.TOP_TERMS_CONSTANT_BOOLEAN_REWRITE); ScoreDoc[] hits = searcher.search(query, ...) ...
          Hide
          Robert Muir added a comment -

          I will wait till after the code freeze and commit this in a few days if no one objects.

          I don't claim its a 'best-practice' fix for fuzzy (see LUCENE-329 for ideas on that), I just think TOP_TERMS_CONSTANT_BOOLEAN_REWRITE is a useful complement to TOP_TERMS_SCORING_BOOLEAN_REWRITE, for MultiTermQueries that want the Top-N terms expansion, but the constant score behavior of CONSTANT_BOOLEAN_REWRITE.

          this patch doesnt change any defaults for fuzzy either. in fact its not specific to fuzzy at all.

          Show
          Robert Muir added a comment - I will wait till after the code freeze and commit this in a few days if no one objects. I don't claim its a 'best-practice' fix for fuzzy (see LUCENE-329 for ideas on that), I just think TOP_TERMS_CONSTANT_BOOLEAN_REWRITE is a useful complement to TOP_TERMS_SCORING_BOOLEAN_REWRITE, for MultiTermQueries that want the Top-N terms expansion, but the constant score behavior of CONSTANT_BOOLEAN_REWRITE. this patch doesnt change any defaults for fuzzy either. in fact its not specific to fuzzy at all.
          Hide
          Robert Muir added a comment -

          uwe pointed out to me, i think there is a naming problem with TOP_TERMS_CONSTANT_BOOLEAN_REWRITE, as the entire booleanquery will not produce the same score like CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE.

          I think the behavior makes sense though, as it wouldnt make sense to use TOP_TERMS without per-term boosting, but we need to fix the naming... and TOP_TERMS_BOOST_BOOLEAN_REWRITE sounds confusing.

          Show
          Robert Muir added a comment - uwe pointed out to me, i think there is a naming problem with TOP_TERMS_CONSTANT_BOOLEAN_REWRITE, as the entire booleanquery will not produce the same score like CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE. I think the behavior makes sense though, as it wouldnt make sense to use TOP_TERMS without per-term boosting, but we need to fix the naming... and TOP_TERMS_BOOST_BOOLEAN_REWRITE sounds confusing.
          Hide
          Uwe Schindler added a comment -

          I will wait till after the code freeze and commit this in a few days if no one objects.

          The code freeze only affects branches. Trunk is only frozen for fixes that should also go into branches.

          Show
          Uwe Schindler added a comment - I will wait till after the code freeze and commit this in a few days if no one objects. The code freeze only affects branches. Trunk is only frozen for fixes that should also go into branches.
          Hide
          Robert Muir added a comment -

          The code freeze only affects branches. Trunk is only frozen for fixes that should also go into branches.

          ok, well I will wait on this one anyway especially as there is a concern about naming... no rush, looks like its been open for a long time.

          Show
          Robert Muir added a comment - The code freeze only affects branches. Trunk is only frozen for fixes that should also go into branches. ok, well I will wait on this one anyway especially as there is a concern about naming... no rush, looks like its been open for a long time.
          Hide
          Robert Muir added a comment -

          Attached is an updated patch:

          • Synced to trunk as these PQ rewrite methods allow setting of the size
          • Renamed to TopTermsBoostOnlyBooleanQueryRewrite

          Please review, I think this rewrite method would also be helpful for
          improving Fuzzy's junit tests: Testing that scores are correct, etc.

          Show
          Robert Muir added a comment - Attached is an updated patch: Synced to trunk as these PQ rewrite methods allow setting of the size Renamed to TopTermsBoostOnlyBooleanQueryRewrite Please review, I think this rewrite method would also be helpful for improving Fuzzy's junit tests: Testing that scores are correct, etc.
          Hide
          Michael McCandless added a comment -

          Patch looks good Robert!

          Show
          Michael McCandless added a comment - Patch looks good Robert!
          Hide
          Robert Muir added a comment -

          Thanks Mike. I will commit later today if no one objects.

          Show
          Robert Muir added a comment - Thanks Mike. I will commit later today if no one objects.
          Hide
          Robert Muir added a comment -

          Committed revision 920499.

          Show
          Robert Muir added a comment - Committed revision 920499.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Cormac Twomey
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development