Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      NearSpans appears to have a bug in skipTo that causes it to skip over some matching documents completely. I discovered this bug while investigating problems with SpanWeight.explain, but as far as I can tell the Bug is not specific to Explanations ... it seems like it could potentially result in incorrect matching in some situations where a SpanNearQuery is nested in another query such thatskipTo will be used ... I tried to create a high level test case to exploit the bug when searching, but i could not. TestCase exploiting the class using NearSpan and SpanScorer will follow...

      1. TestNearSpans.java
        6 kB
        Hoss Man
      2. SpanScorer.explain.testcase.patch
        0.6 kB
        Hoss Man
      3. SpanNearQuery20060622.patch
        0.6 kB
        Paul Elschot
      4. NearSpansUnordered.java
        7 kB
        Paul Elschot
      5. NearSpansOrdered.java
        9 kB
        Paul Elschot
      6. NearSpans20060903.patch
        35 kB
        Paul Elschot
      7. LUCENE-569.ubber.patch
        35 kB
        Hoss Man
      8. common-build.assertions.patch
        0.6 kB
        Paul Elschot

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          test case using NearSpans directly that demonstrates bug.

          Show
          Hoss Man added a comment - test case using NearSpans directly that demonstrates bug.
          Hide
          Hoss Man added a comment -

          fixed title

          Show
          Hoss Man added a comment - fixed title
          Hide
          Paul Elschot added a comment -

          Hoss,

          There is an alternative NearSpans here:
          http://issues.apache.org/jira/browse/LUCENE-413
          in the files:
          NearSpansOrdered.java
          NearSpansUnordered.java
          SpanNearQueryPatch1.txt

          It would be good to see whether these have the same skipTo bug.
          I'll have a look into this in a few days, please feel free to beat me to it.

          Anyway, thanks for the courage to review all of Lucene's use of Explanation.
          Explaining a score value is just as involved as the normal search mechanisms,
          as it is necessary to redo the whole search, but only for a single document,
          and then literally explain all the details in program code.

          Show
          Paul Elschot added a comment - Hoss, There is an alternative NearSpans here: http://issues.apache.org/jira/browse/LUCENE-413 in the files: NearSpansOrdered.java NearSpansUnordered.java SpanNearQueryPatch1.txt It would be good to see whether these have the same skipTo bug. I'll have a look into this in a few days, please feel free to beat me to it. Anyway, thanks for the courage to review all of Lucene's use of Explanation. Explaining a score value is just as involved as the normal search mechanisms, as it is necessary to redo the whole search, but only for a single document, and then literally explain all the details in program code.
          Hide
          Paul Elschot added a comment -

          Fwiw, TestNearSpans passes on my working copy that has the alternative NearSpans.
          It's still worthwhile to try locate the bug in the current NearSpans.

          The last skipTo bug in DisjunctionSumScorer was caused by me not understanding
          that skipTo() always advances at least one document, even when the given target
          is at or before the current document of the Scorer on which skipTo is called.
          I would not be surprised if this turns out to be a similar problem.

          Show
          Paul Elschot added a comment - Fwiw, TestNearSpans passes on my working copy that has the alternative NearSpans. It's still worthwhile to try locate the bug in the current NearSpans. The last skipTo bug in DisjunctionSumScorer was caused by me not understanding that skipTo() always advances at least one document, even when the given target is at or before the current document of the Scorer on which skipTo is called. I would not be surprised if this turns out to be a similar problem.
          Hide
          Paul Elschot added a comment -

          Hoss,

          I'm afraid you've uncovered a bug in NearSpans.java for the ordered case.
          The test case testNearSpansSkipToLikeNext() uses this test data:
          doc 0: w1 w2 w3 .. ..
          doc 1: w1 w3 w2 w3 ..
          and an ordered SpanNearQuery with slop 1 for "w1 w2 w3" should match doc 0 and doc 1
          The test first does a skipTo(0) on the NearSpans which succeeds to match doc 0.
          Then it tries skipTo(1) on the NearSpans, which should succeed, but fails, because
          NearSpans first does skipTo(1) on the Spans for the query terms,
          which puts these term spans at
          doc 1: w1 w3 w2
          (as expected) but this does not match because it's not ordered.
          The NearSpans then tries a next() on itself, which starts by doing next() on the term spans
          for w1 in NearSpans.java near line 146:
          more = min().next(); // trigger further scanning
          However, in the ordered case, it should have advanced the first non ordered term,
          here w3, and so it misses the match:
          doc 1: w1 .. w2 w3 ..

          I would recommend to use the alternative NearSpans from LUCENE 413 mentioned above
          to fix this. The NearSpansOrdered there differs from the current version in that it does not
          match overlapping subspans, but it passes all current test cases including TestNearSpans here.
          Overlaps between Spans can occur when SpanNearQueries are nested and/or when multiple
          terms are indexed on the same position.
          In case this ordered non overlapping matching becomes an issue, it can always be fixed later.
          The NearSpansUnordered there is just like the current NearSpans, only simplified, and this
          matches overlapping subspans.

          Show
          Paul Elschot added a comment - Hoss, I'm afraid you've uncovered a bug in NearSpans.java for the ordered case. The test case testNearSpansSkipToLikeNext() uses this test data: doc 0: w1 w2 w3 .. .. doc 1: w1 w3 w2 w3 .. and an ordered SpanNearQuery with slop 1 for "w1 w2 w3" should match doc 0 and doc 1 The test first does a skipTo(0) on the NearSpans which succeeds to match doc 0. Then it tries skipTo(1) on the NearSpans, which should succeed, but fails, because NearSpans first does skipTo(1) on the Spans for the query terms, which puts these term spans at doc 1: w1 w3 w2 (as expected) but this does not match because it's not ordered. The NearSpans then tries a next() on itself, which starts by doing next() on the term spans for w1 in NearSpans.java near line 146: more = min().next(); // trigger further scanning However, in the ordered case, it should have advanced the first non ordered term, here w3, and so it misses the match: doc 1: w1 .. w2 w3 .. I would recommend to use the alternative NearSpans from LUCENE 413 mentioned above to fix this. The NearSpansOrdered there differs from the current version in that it does not match overlapping subspans, but it passes all current test cases including TestNearSpans here. Overlaps between Spans can occur when SpanNearQueries are nested and/or when multiple terms are indexed on the same position. In case this ordered non overlapping matching becomes an issue, it can always be fixed later. The NearSpansUnordered there is just like the current NearSpans, only simplified, and this matches overlapping subspans.
          Hide
          Hoss Man added a comment -

          I tried to make sense of the existing NearSpans implimentation over the weekend ... i did not succeed. I still haven't had a cahnce to look at the new one in LUCENE-413 but i wnat to clarify something you said..

          >>> The NearSpansOrdered there differs from the current version in that it does not
          >>> match overlapping subspans, but it passes all current test cases including TestNearSpans here.

          ...should I understand you to mean then that the current implimentaion of NearSpans does work correctly with overlapping sub-spans ... there just isnt' a test for it?

          that seems like important enough behavior that we wouldn't want to break it to fix this bug.

          Even if matching on overlapping subspans wasn't an intentional feature of NearSpans – the fact that it currently works and the documentation is silent on the issue suggests to me that it should remain supported.

          Show
          Hoss Man added a comment - I tried to make sense of the existing NearSpans implimentation over the weekend ... i did not succeed. I still haven't had a cahnce to look at the new one in LUCENE-413 but i wnat to clarify something you said.. >>> The NearSpansOrdered there differs from the current version in that it does not >>> match overlapping subspans, but it passes all current test cases including TestNearSpans here. ...should I understand you to mean then that the current implimentaion of NearSpans does work correctly with overlapping sub-spans ... there just isnt' a test for it? that seems like important enough behavior that we wouldn't want to break it to fix this bug. Even if matching on overlapping subspans wasn't an intentional feature of NearSpans – the fact that it currently works and the documentation is silent on the issue suggests to me that it should remain supported.
          Hide
          Paul Elschot added a comment -

          > I tried to make sense of the existing NearSpans implimentation over the weekend ... i did not succeed.
          > I still haven't had a cahnce to look at the new one in LUCENE-413 but i wnat to clarify something you said..

          For the unordered case the priority queue implementation over the subspans in the current NearSpans is fine.
          For the ordered case I could not figure out how to deal with the priority queue and the restriction on
          ordering at the same time. This is precisely what the bug above shows.

          > >>> The NearSpansOrdered there differs from the current version in that it does not
          > >>> match overlapping subspans, but it passes all current test cases including TestNearSpans here.
          >
          > ...should I understand you to mean then that the current implimentaion of NearSpans does work
          > correctly with overlapping sub-spans ... there just isnt' a test for it?

          For ordered queries, it might work with overlapping sub-spans on some cases.
          However, I'd expect any test to run into the bug above for some other ordered cases.

          > that seems like important enough behavior that we wouldn't want to break it to fix this bug.

          Given the bug, I hope nothing depends on it.

          > Even if matching on overlapping subspans wasn't an intentional feature of NearSpans – the fact that it
          > currently works and the documentation is silent on the issue suggests to me that it should remain supported.

          That can probably be done by modifying the NearSpansOrdered of LUCENE-413 at lines 133-138 and at
          line 167 where the end of the previous (possibly matching) subspans is compared to the start of the next one.
          This could compare the start with the start instead.
          I don't know what precisely is the intended behaviour, so I can't say whether these changed comparisons
          should allow equality or not. Perhaps the ends should be compared when the starts are equal,
          just like it is done in the priority queue for the unordered case.

          Show
          Paul Elschot added a comment - > I tried to make sense of the existing NearSpans implimentation over the weekend ... i did not succeed. > I still haven't had a cahnce to look at the new one in LUCENE-413 but i wnat to clarify something you said.. For the unordered case the priority queue implementation over the subspans in the current NearSpans is fine. For the ordered case I could not figure out how to deal with the priority queue and the restriction on ordering at the same time. This is precisely what the bug above shows. > >>> The NearSpansOrdered there differs from the current version in that it does not > >>> match overlapping subspans, but it passes all current test cases including TestNearSpans here. > > ...should I understand you to mean then that the current implimentaion of NearSpans does work > correctly with overlapping sub-spans ... there just isnt' a test for it? For ordered queries, it might work with overlapping sub-spans on some cases. However, I'd expect any test to run into the bug above for some other ordered cases. > that seems like important enough behavior that we wouldn't want to break it to fix this bug. Given the bug, I hope nothing depends on it. > Even if matching on overlapping subspans wasn't an intentional feature of NearSpans – the fact that it > currently works and the documentation is silent on the issue suggests to me that it should remain supported. That can probably be done by modifying the NearSpansOrdered of LUCENE-413 at lines 133-138 and at line 167 where the end of the previous (possibly matching) subspans is compared to the start of the next one. This could compare the start with the start instead. I don't know what precisely is the intended behaviour, so I can't say whether these changed comparisons should allow equality or not. Perhaps the ends should be compared when the starts are equal, just like it is done in the priority queue for the unordered case.
          Hide
          Hoss Man added a comment -

          Attitional test case patch that should work if this bug is fixed. (orriginally from LUCENE-557 but uncommited)

          Show
          Hoss Man added a comment - Attitional test case patch that should work if this bug is fixed. (orriginally from LUCENE-557 but uncommited)
          Hide
          Paul Elschot added a comment -

          TestSpanExplanations (latest) with the SpanScorer.explain.testcase.patch passes here,
          but my intermediate version of NearSpansOrdered.java does not pass TestSpans yet.

          Show
          Paul Elschot added a comment - TestSpanExplanations (latest) with the SpanScorer.explain.testcase.patch passes here, but my intermediate version of NearSpansOrdered.java does not pass TestSpans yet.
          Hide
          Paul Elschot added a comment -

          These two attachments (NearSpansOrdered and NearSpansUnordered), together with the near spans patch at the issue indicated above, pass all tests here, including the patched spans explanation test.

          The attachments share the code that defines the ordering of spans.

          Some minor refactoring will probably be needed.

          Show
          Paul Elschot added a comment - These two attachments (NearSpansOrdered and NearSpansUnordered), together with the near spans patch at the issue indicated above, pass all tests here, including the patched spans explanation test. The attachments share the code that defines the ordering of spans. Some minor refactoring will probably be needed.
          Hide
          Paul Elschot added a comment -

          common-build.assertions.patch enables assertions in org.apache.lucene during the junit tests.

          NearSpansOrdered.java is refactored from the previous version of 2 days ago.
          It contains some assert statements.

          All tests pass here as before with these two attachments.

          Show
          Paul Elschot added a comment - common-build.assertions.patch enables assertions in org.apache.lucene during the junit tests. NearSpansOrdered.java is refactored from the previous version of 2 days ago. It contains some assert statements. All tests pass here as before with these two attachments.
          Hide
          Paul Elschot added a comment -

          The patch to SpanNearQuery at the other issue still uses NearSpans.
          Using the patch attached here, NearSpans is not used anymore.

          Show
          Paul Elschot added a comment - The patch to SpanNearQuery at the other issue still uses NearSpans. Using the patch attached here, NearSpans is not used anymore.
          Hide
          Paul Elschot added a comment -

          NearSpans20060903.patch combines the changes for the bug fix on the
          following files against revision 439747 (today):

          src/java/org/apache/lucene/search/spans/NearSpansOrdered.java, added
          src/java/org/apache/lucene/search/spans/NearSpansUnordered.java, added
          src/java/org/apache/lucene/search/spans/SpanNearQuery.java, changed
          src/test/org/apache/lucene/search/spans/TestNearSpans.java, added
          src/java/org/apache/lucene/search/spans/NearSpans.java, deleted.

          This obsoletes the patch to SpanNearQuery.java.
          (It leaves out one line of commented code that is in the separate patch for SpanNearQuery.java .)

          The added NearSpansUnordered is a simple specialisation of the deleted
          NearSpans for the unordered case.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - NearSpans20060903.patch combines the changes for the bug fix on the following files against revision 439747 (today): src/java/org/apache/lucene/search/spans/NearSpansOrdered.java, added src/java/org/apache/lucene/search/spans/NearSpansUnordered.java, added src/java/org/apache/lucene/search/spans/SpanNearQuery.java, changed src/test/org/apache/lucene/search/spans/TestNearSpans.java, added src/java/org/apache/lucene/search/spans/NearSpans.java, deleted. This obsoletes the patch to SpanNearQuery.java. (It leaves out one line of commented code that is in the separate patch for SpanNearQuery.java .) The added NearSpansUnordered is a simple specialisation of the deleted NearSpans for the unordered case. Regards, Paul Elschot
          Hide
          Hoss Man added a comment -

          Okay, I've attached a cosolidation of SpanScorer.explain.testcase.patch and NearSpans20060903.patch, with a few small modifications...

          1) renamed TestNearSpans.java to TestNearSpansOrdered.java
          2) renamed some tst* methods test* (I think Paul had disabled them for faster testing
          since they weren't broken, but they provide good coverage)
          3) removed some javadocs from NearSpansOrdered so they are inherited from Spans
          (they were the same)

          I can't say I completely understand everything going on in all of this code ... but it passes all of the existing unit tests (as well as the new ones) and I trust Paul; so in the spirit of "be bold" I'll commit these tomorow unless anyone objects.

          To recap the major issues for people who may not be familiar with this issue...

          • The existing NearSpans, which used in an "inOrder" SpanNear query, has a bug
            where it fails to recognize ordered spans if they overlap with unordered spans.
          • Paul has proposed a fix by eliminating the NearSpans class and replacing it with
            both NearSpansOrdered and NearSpansUnordered - SpanNearQuery will pick the
            appropriate class when getSpans is called.
          Show
          Hoss Man added a comment - Okay, I've attached a cosolidation of SpanScorer.explain.testcase.patch and NearSpans20060903.patch, with a few small modifications... 1) renamed TestNearSpans.java to TestNearSpansOrdered.java 2) renamed some tst* methods test* (I think Paul had disabled them for faster testing since they weren't broken, but they provide good coverage) 3) removed some javadocs from NearSpansOrdered so they are inherited from Spans (they were the same) I can't say I completely understand everything going on in all of this code ... but it passes all of the existing unit tests (as well as the new ones) and I trust Paul; so in the spirit of "be bold" I'll commit these tomorow unless anyone objects. To recap the major issues for people who may not be familiar with this issue... The existing NearSpans, which used in an "inOrder" SpanNear query, has a bug where it fails to recognize ordered spans if they overlap with unordered spans. Paul has proposed a fix by eliminating the NearSpans class and replacing it with both NearSpansOrdered and NearSpansUnordered - SpanNearQuery will pick the appropriate class when getSpans is called.
          Hide
          Paul Elschot added a comment -

          Hoss, thanks for the modifications. The tst* methods were indeed renamed from test*.

          Is there a way to invoke an ant test indicating the testcase and the testmethod to be called?

          Show
          Paul Elschot added a comment - Hoss, thanks for the modifications. The tst* methods were indeed renamed from test*. Is there a way to invoke an ant test indicating the testcase and the testmethod to be called?
          Hide
          Hoss Man added a comment -

          Thanks again for figuring all of this out Paul.

          Show
          Hoss Man added a comment - Thanks again for figuring all of this out Paul.
          Hide
          Doron Cohen added a comment -

          It seems that having "assert()" in NearSpanOrdered.java now required Java 1.5 in order to compile Lucene. This would require 1.5 for running Lucene. Do we want to include this now?

          Show
          Doron Cohen added a comment - It seems that having "assert()" in NearSpanOrdered.java now required Java 1.5 in order to compile Lucene. This would require 1.5 for running Lucene. Do we want to include this now?
          Hide
          Steven Parkes added a comment -

          I don't think so. And I'm glad someone is checking. Got a patch, Doron?

          Show
          Steven Parkes added a comment - I don't think so. And I'm glad someone is checking. Got a patch, Doron?
          Hide
          Doron Cohen added a comment -

          Chris Hostetter wrote:

          > Really? ... the build.xml currently sets the javac -source and -target to
          > 1.4 so if that were true i would except it to have failed, and the
          > documentation for J2SE 1.4.2 indicates that assertion support exists in
          > 1.4.
          >
          > while writting this i attempted an "ant clean test" using Java 1.4 and
          > everything seemed to work fine.

          You are right Chris, my mistake, compilation passed for me with 1.5 but failed with 1.4 so I assumed this was the case, but apparently for 1.4 I had 1.3 for the source compatibility (in Eclipse). I changed to 1.4 and it passed with no problems.

          Sorry for this noise,
          Doron

          Show
          Doron Cohen added a comment - Chris Hostetter wrote: > Really? ... the build.xml currently sets the javac -source and -target to > 1.4 so if that were true i would except it to have failed, and the > documentation for J2SE 1.4.2 indicates that assertion support exists in > 1.4. > > while writting this i attempted an "ant clean test" using Java 1.4 and > everything seemed to work fine. You are right Chris, my mistake, compilation passed for me with 1.5 but failed with 1.4 so I assumed this was the case, but apparently for 1.4 I had 1.3 for the source compatibility (in Eclipse). I changed to 1.4 and it passed with no problems. Sorry for this noise, Doron

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development