Lucene - Core
  1. Lucene - Core
  2. LUCENE-1310

Phrase query with term repeated 3 times requires more slop than expected

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.1, 2.3.2
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Consider a document with the text "A A A".
      The phrase query "A A A" (exact match) succeeds.
      The query "A A A"~1 (same document and query, just increasing the slop value by one) fails.
      "A A A"~2 succeeds again.

      If the exact match succeeds, I wouldn't expect the same query but with more slop to fail. The fault seems to require some term to be repeated at least three times in the query, but the three occurrences do not need to be adjacent. I will attach a file that contains a set of JUnit tests that demonstrate what I mean.

      1. LUCENE-1310.1.patch
        2 kB
        Grant Glouser
      2. LUCENE-1310.patch
        13 kB
        Doron Cohen
      3. LUCENE-1310.patch
        13 kB
        Doron Cohen
      4. LUCENE-1310.patch
        12 kB
        Doron Cohen
      5. LUCENE-1310.patch
        11 kB
        Doron Cohen
      6. LUCENE-1310.patch
        10 kB
        Doron Cohen
      7. LUCENE-2.3.1-1310.patch
        13 kB
        Doron Cohen
      8. TestSloppyPhraseQuery.java
        3 kB
        Grant Glouser

        Activity

        Hide
        Doron Cohen added a comment -

        I am too getting the error with the test. It is a bug indeed.
        I think I see where the problem is - working on it.

        Show
        Doron Cohen added a comment - I am too getting the error with the test. It is a bug indeed. I think I see where the problem is - working on it.
        Hide
        Doron Cohen added a comment -

        Patch with a fix.

        Problem was in the logic for advancing PhrasePositions that were pointing to exactly the same term in the document.
        Note that this advancing is required to avoid false matches (as was fixed in LUCENE-736).
        However must first advance the PhrasePosition whose offset (in the query) is the highest.

        As a side effect of this fix sorting of the "repeats" array (at scorer initialization) is no longer required.

        Grant's test is also in the patch, slightly modified.

        Grant, can you give it a try and report here?

        Show
        Doron Cohen added a comment - Patch with a fix. Problem was in the logic for advancing PhrasePositions that were pointing to exactly the same term in the document. Note that this advancing is required to avoid false matches (as was fixed in LUCENE-736 ). However must first advance the PhrasePosition whose offset (in the query) is the highest. As a side effect of this fix sorting of the "repeats" array (at scorer initialization) is no longer required. Grant's test is also in the patch, slightly modified. Grant, can you give it a try and report here?
        Hide
        Grant Glouser added a comment -

        There is a potential NPE. I am adding a patch (which should apply on top of the previous patch) that adds a test case and possible fix.

        I will continue testing this. Thanks for looking at it, Doron.

        Show
        Grant Glouser added a comment - There is a potential NPE. I am adding a patch (which should apply on top of the previous patch) that adds a test case and possible fix. I will continue testing this. Thanks for looking at it, Doron.
        Hide
        Doron Cohen added a comment -

        You're right! And thanks for cacthing this!
        NPE is possible and I see it too with the new test.
        I think the suggested new fix is likely to miss additional matches in the same doc.
        I'll later post a test that shows this.

        (As a side comment, its useful to post patches named "LUCENE-NNN.patch" where NNN is the issue number.
        This way JIRA shows which fix is the most recent very clearly, and also, being a complete patch, everyone can easily apply the entire patch.
        More on this in the Wiki under HowToContribute.)

        Show
        Doron Cohen added a comment - You're right! And thanks for cacthing this! NPE is possible and I see it too with the new test. I think the suggested new fix is likely to miss additional matches in the same doc. I'll later post a test that shows this. (As a side comment, its useful to post patches named "LUCENE-NNN.patch" where NNN is the issue number. This way JIRA shows which fix is the most recent very clearly, and also, being a complete patch, everyone can easily apply the entire patch. More on this in the Wiki under HowToContribute.)
        Hide
        Doron Cohen added a comment -

        Updated patch with a fix for the NPE and with a test that fails with the previous fix for the NPE.
        The point is to switch to the pp with higher (query) offset in case two pps are in the same (doc) position.

        Show
        Doron Cohen added a comment - Updated patch with a fix for the NPE and with a test that fails with the previous fix for the NPE. The point is to switch to the pp with higher (query) offset in case two pps are in the same (doc) position.
        Hide
        Doron Cohen added a comment -

        Previous patch might restore the queue wrongly - pop pp but put pp2.
        This patch fixes that by returning the correct pp into the pq.
        However it is yet not perfect since the one pp returned to pq might not be the last one advanced.
        This means pq could be sorted incorrectly with regard to repeating terms.
        I didn't manage to create a test case that fails due to this - testDoc4_Query3_All_Slops_Should_match in the test was the last trial to catch this.
        The only perfect solution I see is to re-populate the queue when this happens but this is costly and I tend not to do it.
        Open for suggestions...

        Show
        Doron Cohen added a comment - Previous patch might restore the queue wrongly - pop pp but put pp2. This patch fixes that by returning the correct pp into the pq. However it is yet not perfect since the one pp returned to pq might not be the last one advanced. This means pq could be sorted incorrectly with regard to repeating terms. I didn't manage to create a test case that fails due to this - testDoc4_Query3_All_Slops_Should_match in the test was the last trial to catch this. The only perfect solution I see is to re-populate the queue when this happens but this is costly and I tend not to do it. Open for suggestions...
        Hide
        Doron Cohen added a comment -

        Updated patch fixes this issue.
        In case of repeating terms in the query, this might be slower than previous patch, but it is supposedly correct in all cases while the previous one was not guaranteed to be always correct. There are no performance implications for the more common case of no repeating terms in the query.
        I plan to commit this in a day or two.

        Show
        Doron Cohen added a comment - Updated patch fixes this issue. In case of repeating terms in the query, this might be slower than previous patch, but it is supposedly correct in all cases while the previous one was not guaranteed to be always correct. There are no performance implications for the more common case of no repeating terms in the query. I plan to commit this in a day or two.
        Hide
        Grant Glouser added a comment -

        termPositionsDiffer can return the PhrasePositions that was passed in (pp), which means you could be passing the same PhrasePositions to flip in both arguments. But flip assumes that the arguments are different. This can result in an ArrayIndexOutOfBoundsException in flip. Perhaps just checking that pp2 != pp on line 76 would be sufficient to avoid this.

        I have not been able to come up with a simple test case that triggers this. I have a complex one, but it uses a custom Analyzer.

        Show
        Grant Glouser added a comment - termPositionsDiffer can return the PhrasePositions that was passed in (pp), which means you could be passing the same PhrasePositions to flip in both arguments. But flip assumes that the arguments are different. This can result in an ArrayIndexOutOfBoundsException in flip. Perhaps just checking that pp2 != pp on line 76 would be sufficient to avoid this. I have not been able to come up with a simple test case that triggers this. I have a complex one, but it uses a custom Analyzer.
        Hide
        Doron Cohen added a comment -

        You're right, Grant, good catch, thanks!
        (I'm sure had this check but lost it when refactoring flipping to a method)
        Updated patch to follow.

        Show
        Doron Cohen added a comment - You're right, Grant, good catch, thanks! (I'm sure had this check but lost it when refactoring flipping to a method) Updated patch to follow.
        Hide
        Doron Cohen added a comment -

        Updated patch with required check (pp!=pp2) before flipping as Grant suggested.

        Show
        Doron Cohen added a comment - Updated patch with required check (pp!=pp2) before flipping as Grant suggested.
        Hide
        Doron Cohen added a comment -

        same patch but for 2.3.1.

        Show
        Doron Cohen added a comment - same patch but for 2.3.1.
        Hide
        Doron Cohen added a comment -

        Committed to trunk.

        I was wondering if this should be backported to 2.3.1 and/or 2.3.2.
        It is not a major bug so I think not, though it might be critical to some application.
        Didn't find a guideline for this in the wiki but I may be missing it?

        Show
        Doron Cohen added a comment - Committed to trunk. I was wondering if this should be backported to 2.3.1 and/or 2.3.2. It is not a major bug so I think not, though it might be critical to some application. Didn't find a guideline for this in the wiki but I may be missing it?
        Hide
        Doron Cohen added a comment -

        Fixed, thanks for the review Grant!

        Show
        Doron Cohen added a comment - Fixed, thanks for the review Grant!

          People

          • Assignee:
            Doron Cohen
            Reporter:
            Grant Glouser
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development