Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today, if you ask FuzzyQuery to match abcd with edit distance 2, it will fail to match the term ab even though it's 2 edits away.

      Its javadocs explain this:

       * <p>NOTE: terms of length 1 or 2 will sometimes not match because of how the scaled
       * distance between two terms is computed.  For a term to match, the edit distance between
       * the terms must be less than the minimum length term (either the input term, or
       * the candidate term).  For example, FuzzyQuery on term "abcd" with maxEdits=2 will
       * not match an indexed term "ab", and FuzzyQuery on term "a" with maxEdits=2 will not
       * match an indexed term "abc".
      

      On the one hand, I can see that this behavior is sort of justified in that 50% of the characters are different and so this is a very "weak" match, but on the other hand, it's quite unexpected since edit distance is such an exact measure so the terms should have matched.

      It seems like the behavior is caused by internal implementation details about how the relative (floating point) score is computed. I think we should fix it, so that edit distance 2 does in fact match all terms with edit distance <= 2.

      1. LUCENE-7439.patch
        8 kB
        Michael McCandless
      2. LUCENE-7439.patch
        91 kB
        Michael McCandless
      3. LUCENE-7439.patch
        84 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          Added link to LUCENE-5206. +1 on fixing this. Thank you!

          Show
          tallison@mitre.org Tim Allison added a comment - Added link to LUCENE-5206 . +1 on fixing this. Thank you!
          Hide
          mikemccand Michael McCandless added a comment -

          I was struggling to understand the control flow in
          FuzzyQuery/FuzzyTermsEnum, MultiTermQuery, TopTermsRewrite, etc., so
          as a first step here I cleaned up deprecated code and tried to
          simplify FuzzyTermsEnum somewhat.

          The attached patch is just this cleanup; it doesn't change the
          behavior on short terms.

          All tests pass and I confirmed performance (on Wikipedia) is
          unchanged. I plan to first commit this cleanup (master only, removing
          deprecations), and then separately tackle the short terms.

          Show
          mikemccand Michael McCandless added a comment - I was struggling to understand the control flow in FuzzyQuery/FuzzyTermsEnum, MultiTermQuery, TopTermsRewrite, etc., so as a first step here I cleaned up deprecated code and tried to simplify FuzzyTermsEnum somewhat. The attached patch is just this cleanup; it doesn't change the behavior on short terms. All tests pass and I confirmed performance (on Wikipedia) is unchanged. I plan to first commit this cleanup (master only, removing deprecations), and then separately tackle the short terms.
          Hide
          mikemccand Michael McCandless added a comment -

          Another iteration, just adding a randomized test, which seems to be passing.

          Show
          mikemccand Michael McCandless added a comment - Another iteration, just adding a randomized test, which seems to be passing.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit faf3bc3134c6e5ba3e2caa15762524872e083152 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=faf3bc3 ]

          LUCENE-7439: clean up FuzzyQuery/FuzzyTermsEnum sources

          Show
          jira-bot ASF subversion and git services added a comment - Commit faf3bc3134c6e5ba3e2caa15762524872e083152 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=faf3bc3 ] LUCENE-7439 : clean up FuzzyQuery/FuzzyTermsEnum sources
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4db3e7b8a7ca818002af9041bf10660c25905915 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4db3e7b ]

          LUCENE-7439: improve test case

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4db3e7b8a7ca818002af9041bf10660c25905915 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4db3e7b ] LUCENE-7439 : improve test case
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ba085d4c8487d43089295c7b145c77eba19497b4 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ba085d4 ]

          LUCENE-7439: back port test case to 6.x

          Show
          jira-bot ASF subversion and git services added a comment - Commit ba085d4c8487d43089295c7b145c77eba19497b4 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ba085d4 ] LUCENE-7439 : back port test case to 6.x
          Hide
          mikemccand Michael McCandless added a comment -

          Here's a patch fixing FuzzyQuery to also accept small terms.

          With the simplified FuzzyTermsEnum it was quite simple to fix it (remove the while loop), and to fix the test case to verify any term within the specified edit distance does match.

          The one wrinkle is that such matches get a boost of 0.0, because the formula we use to compute the boost for a matched term (1.0 - editDistance / minTermLength) can be <= 0. I think this is fair: such matches are poor quality compared to longer term matches.

          Show
          mikemccand Michael McCandless added a comment - Here's a patch fixing FuzzyQuery to also accept small terms. With the simplified FuzzyTermsEnum it was quite simple to fix it (remove the while loop), and to fix the test case to verify any term within the specified edit distance does match. The one wrinkle is that such matches get a boost of 0.0, because the formula we use to compute the boost for a matched term ( 1.0 - editDistance / minTermLength ) can be <= 0. I think this is fair: such matches are poor quality compared to longer term matches.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c7fb49d7b50d171c6d787253b9ab575218fef7fe in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c7fb49d ]

          LUCENE-7439: FuzzyQuery now matches all terms within the specified edit distance, even if they are short

          Show
          jira-bot ASF subversion and git services added a comment - Commit c7fb49d7b50d171c6d787253b9ab575218fef7fe in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c7fb49d ] LUCENE-7439 : FuzzyQuery now matches all terms within the specified edit distance, even if they are short
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 471f90cf825ee3106fef1fa4c1094d0ca461e7fb in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=471f90c ]

          LUCENE-7439: FuzzyQuery now matches all terms within the specified edit distance, even if they are short

          Show
          jira-bot ASF subversion and git services added a comment - Commit 471f90cf825ee3106fef1fa4c1094d0ca461e7fb in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=471f90c ] LUCENE-7439 : FuzzyQuery now matches all terms within the specified edit distance, even if they are short
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c79d44f82814d6d798450a422f73f42891cb1ef5 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c79d44f ]

          LUCENE-7439: move CHANGES entry

          Show
          jira-bot ASF subversion and git services added a comment - Commit c79d44f82814d6d798450a422f73f42891cb1ef5 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c79d44f ] LUCENE-7439 : move CHANGES entry
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development