Lucene - Core
  1. Lucene - Core
  2. LUCENE-6373

Complete two phase doc id iteration support for Spans

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Spin off from LUCENE-6308, see comments there from about 23 March 2015.

      1. LUCENE-6373.patch
        52 kB
        Paul Elschot
      2. LUCENE-6373.patch
        48 kB
        Robert Muir
      3. LUCENE-6373-20150426.patch
        2 kB
        Paul Elschot
      4. LUCENE-6373-SpanOr.patch
        19 kB
        Paul Elschot
      5. LUCENE-6737-SpanOr-oneTestFails.patch
        49 kB
        Paul Elschot

        Activity

        Hide
        Paul Elschot added a comment - - edited

        I started on SpanNotQuery, see first patch. This is mostly a cleanup to have an implementation that is better for a DISI subclass.
        This still has asTwoPhaseIterator commented out to make all test cases pass. With this asTwoPhaseIterator some test cases fail, I have not yet investigated why.

        Added, 11 April: I have deleted the patch for SpanNotQuery.

        Show
        Paul Elschot added a comment - - edited I started on SpanNotQuery, see first patch. This is mostly a cleanup to have an implementation that is better for a DISI subclass. This still has asTwoPhaseIterator commented out to make all test cases pass. With this asTwoPhaseIterator some test cases fail, I have not yet investigated why. Added, 11 April: I have deleted the patch for SpanNotQuery.
        Hide
        Robert Muir added a comment -

        Paul have you seen my alternative patch on LUCENE-6394? I think that SpanNot operates conceptually just like SpanPositionCheck (with a lazy iterator on the NOT spans), and folded SpanPositionCheck.PositionCheckSpan API into FilterSpans.

        Show
        Robert Muir added a comment - Paul have you seen my alternative patch on LUCENE-6394 ? I think that SpanNot operates conceptually just like SpanPositionCheck (with a lazy iterator on the NOT spans), and folded SpanPositionCheck.PositionCheckSpan API into FilterSpans.
        Hide
        Paul Elschot added a comment -

        I missed LUCENE-6394, I'll continue the things for SpanNot there.

        Show
        Paul Elschot added a comment - I missed LUCENE-6394 , I'll continue the things for SpanNot there.
        Hide
        Paul Elschot added a comment - - edited

        Patch for SpanOr of 11 April 2015.
        This is an improvement on SpanOr, but there is no two phase doc id set iteration yet.

        Show
        Paul Elschot added a comment - - edited Patch for SpanOr of 11 April 2015. This is an improvement on SpanOr, but there is no two phase doc id set iteration yet.
        Hide
        Paul Elschot added a comment -

        The SpanOr patch of 11 April 2015:

        Introduces DisiPriorityQueue<Iter extends DocIdSetIterator>, which was started from ScorerPriorityQueue.
        The DisiPriorityQueue is used to reimplement the doc id iteration of SpanOr, which now works much like DisjunctionScorer.
        The position iteration for SpanOr is done by a new SpanPositionQueue, which extends PriorityQueue.
        Tests pass with this patch, and I would expect this to improve the speed of SpanOr.

        Still to be done for SpanOr:

        Use DisiPriorityQueue<Scorer> instead of ScorerPriorityQueue.
        Add two phase doc id iteration in SpanOr similar to DisjunctionScorer.
        Specialize SpanPositionQueue similar to DisiPriorityQueue, inline the position comparison function.

        Show
        Paul Elschot added a comment - The SpanOr patch of 11 April 2015: Introduces DisiPriorityQueue<Iter extends DocIdSetIterator>, which was started from ScorerPriorityQueue. The DisiPriorityQueue is used to reimplement the doc id iteration of SpanOr, which now works much like DisjunctionScorer. The position iteration for SpanOr is done by a new SpanPositionQueue, which extends PriorityQueue. Tests pass with this patch, and I would expect this to improve the speed of SpanOr. Still to be done for SpanOr: Use DisiPriorityQueue<Scorer> instead of ScorerPriorityQueue. Add two phase doc id iteration in SpanOr similar to DisjunctionScorer. Specialize SpanPositionQueue similar to DisiPriorityQueue, inline the position comparison function.
        Hide
        Robert Muir added a comment -

        Hi Paul, thanks for tackling the first steps for SpanOr, this is the hardest one

        Maybe we can get Adrien Grand opinions, since he implemented the two-phase support for DisjunctionScorer.
        From my perspective: Can we still keep asTwoPhaseIterator on scorer/spans? its still just an up-front instanceof and moving it to DISI has other implications.

        Show
        Robert Muir added a comment - Hi Paul, thanks for tackling the first steps for SpanOr, this is the hardest one Maybe we can get Adrien Grand opinions, since he implemented the two-phase support for DisjunctionScorer. From my perspective: Can we still keep asTwoPhaseIterator on scorer/spans? its still just an up-front instanceof and moving it to DISI has other implications.
        Hide
        Paul Elschot added a comment - - edited

        Indeed, two phase doc id iteration for SpanOr is not simple. I think I'm getting there though. It needs two tests that I think I have seen before, but I could not find where:
        One is to avoid calling an approximation.match() again when match is accepted. This can be done by keeping the last doc for which approximation.match() returned true.
        The other is to distinguish between approximation+acceptance and normal acceptance of a matching doc. This can be done by keeping the last doc for which twoPhaseCurrentDocMatches returned true.
        This is still cooking, and I don't expect to finish this very soon.

        Can we still keep asTwoPhaseIterator on scorer/spans?

        For the SpanOr patch found I there was a second place that needs an implementation with instanceof checks.
        This second place is DisiWrapper for disjunctions, the existing one is in ConjunctionDISI.
        So I thought about where to put this in a single place, and that place is DocIdSetIterator.
        The patch implementation with instanceof checks is just writing out the inheritance, it would be simpler to just return null by default,
        and leave the rest to inheritance, and/or lave the method abstract.
        Since there are other implications, the question is: is there a better place to put this?
        The situation has changed, it is now not only Scorer but also Spans that have this.

        Show
        Paul Elschot added a comment - - edited Indeed, two phase doc id iteration for SpanOr is not simple. I think I'm getting there though. It needs two tests that I think I have seen before, but I could not find where: One is to avoid calling an approximation.match() again when match is accepted. This can be done by keeping the last doc for which approximation.match() returned true. The other is to distinguish between approximation+acceptance and normal acceptance of a matching doc. This can be done by keeping the last doc for which twoPhaseCurrentDocMatches returned true. This is still cooking, and I don't expect to finish this very soon. Can we still keep asTwoPhaseIterator on scorer/spans? For the SpanOr patch found I there was a second place that needs an implementation with instanceof checks. This second place is DisiWrapper for disjunctions, the existing one is in ConjunctionDISI. So I thought about where to put this in a single place, and that place is DocIdSetIterator. The patch implementation with instanceof checks is just writing out the inheritance, it would be simpler to just return null by default, and leave the rest to inheritance, and/or lave the method abstract. Since there are other implications, the question is: is there a better place to put this? The situation has changed, it is now not only Scorer but also Spans that have this.
        Hide
        Robert Muir added a comment -

        Well, it might be that both Scorer and Spans have it (temporarily), but i don't think it should be on DISI since ultimately it does not belong there? It doesn't make sense for all other subclasses of DISI (e.g. postings) and it doesnt make sense to be recursive.

        And if the long term goal is to merge the functionality of Spans into Scorer anyway, then i'd prefer the instanceof check for now, rather than adding it to DISI, then having to take it away from DISI and put back into Scorer later.

        Show
        Robert Muir added a comment - Well, it might be that both Scorer and Spans have it (temporarily), but i don't think it should be on DISI since ultimately it does not belong there? It doesn't make sense for all other subclasses of DISI (e.g. postings) and it doesnt make sense to be recursive. And if the long term goal is to merge the functionality of Spans into Scorer anyway, then i'd prefer the instanceof check for now, rather than adding it to DISI, then having to take it away from DISI and put back into Scorer later.
        Hide
        Robert Muir added a comment -

        Indeed, two phase doc id iteration for SpanOr is not simple. I think I'm getting there though. It needs two tests that I think I have seen before, but I could not find where:
        One is to avoid calling an approximation.match() again when match is accepted. This can be done by keeping the last doc for which approximation.match() returned true.
        The other is to distinguish between approximation+acceptance and normal acceptance of a matching doc. This can be done by keeping the last doc for which twoPhaseCurrentDocMatches returned true.
        This is still cooking, and I don't expect to finish this very soon.

        I had the same problems with SpanNot, and it was too hard for me to debug them with the current tests.

        Since LUCENE-6411, we have tons more assertions used by TestSpanSearchEquivalence. It wraps all parts of the span-query-tree in AssertingSpanQuery which means we have consistent checks on every operation and it makes bugs easier to track down since the stacktrace usually points to the offending code.

        It also wraps the two-phase iterator and ensures you don't call matches() twice:

            @Override
            public boolean matches() throws IOException {
              if (approximation.docID() == -1 || approximation.docID() == DocIdSetIterator.NO_MORE_DOCS) {
                throw new AssertionError("matches() should not be called on doc ID " + approximation.docID());
              }
              if (lastDoc == approximation.docID()) {
                throw new AssertionError("matches() has been called twice on doc ID " + approximation.docID());
              }
              // ... more checks
        

        that you don't call nextStartPosition() if you didn't call matches(), or if it returned false, and so on:

          @Override
          public int nextStartPosition() throws IOException {
            assert state != State.DOC_START : "invalid position access, state=" + state + ": " + in;
            assert state != State.DOC_FINISHED : "invalid position access, state=" + state + ": " + in;
            assert state != State.DOC_UNVERIFIED : "invalid position access, state=" + state + ": " + in;
            // ... more checks
        
        Show
        Robert Muir added a comment - Indeed, two phase doc id iteration for SpanOr is not simple. I think I'm getting there though. It needs two tests that I think I have seen before, but I could not find where: One is to avoid calling an approximation.match() again when match is accepted. This can be done by keeping the last doc for which approximation.match() returned true. The other is to distinguish between approximation+acceptance and normal acceptance of a matching doc. This can be done by keeping the last doc for which twoPhaseCurrentDocMatches returned true. This is still cooking, and I don't expect to finish this very soon. I had the same problems with SpanNot, and it was too hard for me to debug them with the current tests. Since LUCENE-6411 , we have tons more assertions used by TestSpanSearchEquivalence. It wraps all parts of the span-query-tree in AssertingSpanQuery which means we have consistent checks on every operation and it makes bugs easier to track down since the stacktrace usually points to the offending code. It also wraps the two-phase iterator and ensures you don't call matches() twice: @Override public boolean matches() throws IOException { if (approximation.docID() == -1 || approximation.docID() == DocIdSetIterator.NO_MORE_DOCS) { throw new AssertionError( "matches() should not be called on doc ID " + approximation.docID()); } if (lastDoc == approximation.docID()) { throw new AssertionError( "matches() has been called twice on doc ID " + approximation.docID()); } // ... more checks that you don't call nextStartPosition() if you didn't call matches(), or if it returned false, and so on: @Override public int nextStartPosition() throws IOException { assert state != State.DOC_START : "invalid position access, state=" + state + ": " + in; assert state != State.DOC_FINISHED : "invalid position access, state=" + state + ": " + in; assert state != State.DOC_UNVERIFIED : "invalid position access, state=" + state + ": " + in; // ... more checks
        Hide
        Adrien Grand added a comment -

        The patch looks like a good start, it's good to see SpanOr and disjunctions looking similar.

        //final long cost; //FIXME: needed?

        This is needed by the min-should-match scorer in order to keep track of scorers which are behind the current document and advance the least-costly instance first.

        Also a minor nitpick but I think we should move SpanPositionQueue to either its own java file or as an inner class of SpanOrQuery in order not to confuse incremental compilation.

        Maybe "Specialize SpanPositionQueue similar to DisiPriorityQueue, inline the position comparison function." could be delayed to another issue in order to keep this one small?

        Show
        Adrien Grand added a comment - The patch looks like a good start, it's good to see SpanOr and disjunctions looking similar. //final long cost; //FIXME: needed? This is needed by the min-should-match scorer in order to keep track of scorers which are behind the current document and advance the least-costly instance first. Also a minor nitpick but I think we should move SpanPositionQueue to either its own java file or as an inner class of SpanOrQuery in order not to confuse incremental compilation. Maybe "Specialize SpanPositionQueue similar to DisiPriorityQueue, inline the position comparison function." could be delayed to another issue in order to keep this one small?
        Hide
        Paul Elschot added a comment -

        Robert,

        And if the long term goal is to merge the functionality of Spans into Scorer anyway,

        I think that we need some scoring mechanism added to Spans, and that completely merging Spans with Scorer will not happen. But we'll see.

        then i'd prefer the instanceof check for now, rather than adding it to DISI, then having to take it away from DISI and put back into Scorer later.

        I don't mind the instanceof checks, but I do mind having the same code for it twice.
        Shall I put it in an in between class TwoPhaseDISI (superclass of Scorer and Spans) so other DISI subclasses are not involved?
        That would continue the merging

        Show
        Paul Elschot added a comment - Robert, And if the long term goal is to merge the functionality of Spans into Scorer anyway, I think that we need some scoring mechanism added to Spans, and that completely merging Spans with Scorer will not happen. But we'll see. then i'd prefer the instanceof check for now, rather than adding it to DISI, then having to take it away from DISI and put back into Scorer later. I don't mind the instanceof checks, but I do mind having the same code for it twice. Shall I put it in an in between class TwoPhaseDISI (superclass of Scorer and Spans) so other DISI subclasses are not involved? That would continue the merging
        Hide
        Paul Elschot added a comment -

        Adrien,

         //final long cost; //FIXME: needed?

        This is needed by the min-should-match scorer ...

        I ran into that when I deleted ScorerPriorityQueue/ScorerWrapper and replaced them DisiPriorityQueue<Scorer> / DisiWrapper<Scorer>.
        I'll introduce a subclass of DisiWrapper<Scorer> for the min should match scorer, and a subclass of DisiWrapper<Spans> for SpanOr.

        I'll put SpanPositionQueue in a separate java file.

        Show
        Paul Elschot added a comment - Adrien, // final long cost; //FIXME: needed? This is needed by the min-should-match scorer ... I ran into that when I deleted ScorerPriorityQueue/ScorerWrapper and replaced them DisiPriorityQueue<Scorer> / DisiWrapper<Scorer>. I'll introduce a subclass of DisiWrapper<Scorer> for the min should match scorer, and a subclass of DisiWrapper<Spans> for SpanOr. I'll put SpanPositionQueue in a separate java file.
        Hide
        Paul Elschot added a comment -

        Patch of 14 April 2015, one test method fails:

        ant test -Dtestcase=TestBasics -Dtests.method=testSpanComplex1 -Dtests.seed=BC9E11D8F8BBB280 -Dtests.slow=true

        And sometimes the TestBasics test case passes when no tests.seed is given.

        Show
        Paul Elschot added a comment - Patch of 14 April 2015, one test method fails: ant test -Dtestcase=TestBasics -Dtests.method=testSpanComplex1 -Dtests.seed=BC9E11D8F8BBB280 -Dtests.slow=true And sometimes the TestBasics test case passes when no tests.seed is given.
        Hide
        Paul Elschot added a comment -

        I'm posting this because it seems to be almost there, I'll have very little time in the upcoming weeks, and to show that the asserting spans of LUCENE-6411 nicely catch this.

        The patch includes the use of DisiPriorityQueue<Scorer> instead of ScorerPriorityQueue, otherwise there are no changes in the class hierarchy (DocIdSetIterator, DisiWrapper) compared to the previous patch.

        The stack trace for the failing test is:

           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBasics -Dtests.method=testSpanComplex1 -Dtests.seed=BC9E11D8F8BBB280 -Dtests.slow=true -Dtests.locale=in_ID -Dtests.timezone=Australia/Sydney -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
           [junit4] FAILURE 0.09s | TestBasics.testSpanComplex1 <<<
           [junit4]    > Throwable #1: java.lang.AssertionError: invalid position access, state=DOC_UNVERIFIED: NearSpansOrdered(spanNear([AssertingSpanQuery(field:seven), AssertingSpanQuery(field:hundred)], 0, true))@0: -1 - -1
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([BC9E11D8F8BBB280:9C85EFEAFD8D21EE]:0)
           [junit4]    > 	at org.apache.lucene.search.spans.AssertingSpans.nextStartPosition(AssertingSpans.java:80)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanOrQuery$1.fillPositionQueue(SpanOrQuery.java:285)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanOrQuery$1.nextStartPosition(SpanOrQuery.java:300)
           [junit4]    > 	at org.apache.lucene.search.spans.AssertingSpans.nextStartPosition(AssertingSpans.java:86)
           [junit4]    > 	at org.apache.lucene.search.spans.NearSpansOrdered.subSpansToFirstStartPosition(NearSpansOrdered.java:133)
           [junit4]    > 	at org.apache.lucene.search.spans.NearSpansOrdered.toMatchDoc(NearSpansOrdered.java:65)
           [junit4]    > 	at org.apache.lucene.search.spans.NearSpans.nextDoc(NearSpans.java:65)
           [junit4]    > 	at org.apache.lucene.search.spans.NearSpansOrdered.nextDoc(NearSpansOrdered.java:48)
           [junit4]    > 	at org.apache.lucene.search.spans.AssertingSpans.nextDoc(AssertingSpans.java:151)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanScorer.nextDoc(SpanScorer.java:51)
           [junit4]    > 	at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192)
           [junit4]    > 	at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:164)
           [junit4]    > 	at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
           [junit4]    > 	at org.apache.lucene.search.AssertingBulkScorer.score(AssertingBulkScorer.java:69)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:563)
           [junit4]    > 	at org.apache.lucene.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:102)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:367)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:486)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:344)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:355)
           [junit4]    > 	at org.apache.lucene.search.CheckHits.checkHits(CheckHits.java:170)
           [junit4]    > 	at org.apache.lucene.search.spans.TestBasics.checkHits(TestBasics.java:408)
           [junit4]    > 	at org.apache.lucene.search.spans.TestBasics.testSpanComplex1(TestBasics.java:398)
        
        Show
        Paul Elschot added a comment - I'm posting this because it seems to be almost there, I'll have very little time in the upcoming weeks, and to show that the asserting spans of LUCENE-6411 nicely catch this. The patch includes the use of DisiPriorityQueue<Scorer> instead of ScorerPriorityQueue, otherwise there are no changes in the class hierarchy (DocIdSetIterator, DisiWrapper) compared to the previous patch. The stack trace for the failing test is: [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestBasics -Dtests.method=testSpanComplex1 -Dtests.seed=BC9E11D8F8BBB280 -Dtests.slow= true -Dtests.locale=in_ID -Dtests.timezone=Australia/Sydney -Dtests.asserts= true -Dtests.file.encoding=US-ASCII [junit4] FAILURE 0.09s | TestBasics.testSpanComplex1 <<< [junit4] > Throwable #1: java.lang.AssertionError: invalid position access, state=DOC_UNVERIFIED: NearSpansOrdered(spanNear([AssertingSpanQuery(field:seven), AssertingSpanQuery(field:hundred)], 0, true ))@0: -1 - -1 [junit4] > at __randomizedtesting.SeedInfo.seed([BC9E11D8F8BBB280:9C85EFEAFD8D21EE]:0) [junit4] > at org.apache.lucene.search.spans.AssertingSpans.nextStartPosition(AssertingSpans.java:80) [junit4] > at org.apache.lucene.search.spans.SpanOrQuery$1.fillPositionQueue(SpanOrQuery.java:285) [junit4] > at org.apache.lucene.search.spans.SpanOrQuery$1.nextStartPosition(SpanOrQuery.java:300) [junit4] > at org.apache.lucene.search.spans.AssertingSpans.nextStartPosition(AssertingSpans.java:86) [junit4] > at org.apache.lucene.search.spans.NearSpansOrdered.subSpansToFirstStartPosition(NearSpansOrdered.java:133) [junit4] > at org.apache.lucene.search.spans.NearSpansOrdered.toMatchDoc(NearSpansOrdered.java:65) [junit4] > at org.apache.lucene.search.spans.NearSpans.nextDoc(NearSpans.java:65) [junit4] > at org.apache.lucene.search.spans.NearSpansOrdered.nextDoc(NearSpansOrdered.java:48) [junit4] > at org.apache.lucene.search.spans.AssertingSpans.nextDoc(AssertingSpans.java:151) [junit4] > at org.apache.lucene.search.spans.SpanScorer.nextDoc(SpanScorer.java:51) [junit4] > at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:192) [junit4] > at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:164) [junit4] > at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35) [junit4] > at org.apache.lucene.search.AssertingBulkScorer.score(AssertingBulkScorer.java:69) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:563) [junit4] > at org.apache.lucene.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:102) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:367) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:486) [junit4] > at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:344) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:355) [junit4] > at org.apache.lucene.search.CheckHits.checkHits(CheckHits.java:170) [junit4] > at org.apache.lucene.search.spans.TestBasics.checkHits(TestBasics.java:408) [junit4] > at org.apache.lucene.search.spans.TestBasics.testSpanComplex1(TestBasics.java:398)
        Hide
        Paul Elschot added a comment -

        To be complete, the patch is against trunk with the last commit less than a day ago from SOLR-6692 ... "only count match snippets",
        git sha1 id: 110eccd07f5d2c95b0f6c49f7d9224ed2e8ccfe1

        Show
        Paul Elschot added a comment - To be complete, the patch is against trunk with the last commit less than a day ago from SOLR-6692 ... "only count match snippets", git sha1 id: 110eccd07f5d2c95b0f6c49f7d9224ed2e8ccfe1
        Hide
        Robert Muir added a comment -

        Paul, I'll apply the patch and dig into it a bit. Thanks for the work here.

        Show
        Robert Muir added a comment - Paul, I'll apply the patch and dig into it a bit. Thanks for the work here.
        Hide
        Robert Muir added a comment -

        Here is another seed from another test, find the same failure:

           [junit4] Suite: org.apache.lucene.search.spans.TestSpanSearchEquivalence
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestSpanSearchEquivalence -Dtests.method=testSpanOrVersusBooleanNear -Dtests.seed=2C8D07E17862417A -Dtests.locale=en_MT -Dtests.timezone=Pacific/Rarotonga -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
           [junit4] FAILURE 0.48s | TestSpanSearchEquivalence.testSpanOrVersusBooleanNear <<<
           [junit4]    > Throwable #1: java.lang.AssertionError: invalid position access, state=DOC_UNVERIFIED: NearSpansOrdered(spanNear([AssertingSpanQuery(field:f), AssertingSpanQuery(field:q)], 10, true))@0: -1 - -1
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([2C8D07E17862417A:E42EBF56AD5D2367]:0)
           [junit4]    > 	at org.apache.lucene.search.spans.AssertingSpans.nextStartPosition(AssertingSpans.java:80)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanOrQuery$1.fillPositionQueue(SpanOrQuery.java:285)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanOrQuery$1.nextStartPosition(SpanOrQuery.java:300)
        

        I will dig in a little bit more. I ran into a similar problem before with SpanNot, maybe this is a similar one, that happened like this:

        1. sub-approximation is positioned on docid 1
        2. we call sub.advance(2) and it returns 3
        3. <we do other processing and then move forward to doc 3, but we never called matches()>

        So in that case we needed some state to be able to lazily call matches() once and only once when actually needed. SpanNot uses this state for its sub, which records the last document we called matches() for and the return value that matches() had.

              int lastApproxDoc = -1;
              boolean lastApproxResult = false;
        

        Maybe this one is a similar thing, i will look more.

        Show
        Robert Muir added a comment - Here is another seed from another test, find the same failure: [junit4] Suite: org.apache.lucene.search.spans.TestSpanSearchEquivalence [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestSpanSearchEquivalence -Dtests.method=testSpanOrVersusBooleanNear -Dtests.seed=2C8D07E17862417A -Dtests.locale=en_MT -Dtests.timezone=Pacific/Rarotonga -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 0.48s | TestSpanSearchEquivalence.testSpanOrVersusBooleanNear <<< [junit4] > Throwable #1: java.lang.AssertionError: invalid position access, state=DOC_UNVERIFIED: NearSpansOrdered(spanNear([AssertingSpanQuery(field:f), AssertingSpanQuery(field:q)], 10, true))@0: -1 - -1 [junit4] > at __randomizedtesting.SeedInfo.seed([2C8D07E17862417A:E42EBF56AD5D2367]:0) [junit4] > at org.apache.lucene.search.spans.AssertingSpans.nextStartPosition(AssertingSpans.java:80) [junit4] > at org.apache.lucene.search.spans.SpanOrQuery$1.fillPositionQueue(SpanOrQuery.java:285) [junit4] > at org.apache.lucene.search.spans.SpanOrQuery$1.nextStartPosition(SpanOrQuery.java:300) I will dig in a little bit more. I ran into a similar problem before with SpanNot, maybe this is a similar one, that happened like this: sub-approximation is positioned on docid 1 we call sub.advance(2) and it returns 3 <we do other processing and then move forward to doc 3, but we never called matches()> So in that case we needed some state to be able to lazily call matches() once and only once when actually needed. SpanNot uses this state for its sub, which records the last document we called matches() for and the return value that matches() had. int lastApproxDoc = -1; boolean lastApproxResult = false ; Maybe this one is a similar thing, i will look more.
        Hide
        Robert Muir added a comment -

        OK i found the bug. sneaky one. in DISIWrapper this is my fix:

          public int lastApproxMatchDoc = -2; // last doc of approximation that did match
          public int lastApproxNonMatchDoc = -2; // last doc of approximation that did not match
        

        Previously only one of these was initialized, the other was initialized to zero. So this could cause problems for docid 0.

        Show
        Robert Muir added a comment - OK i found the bug. sneaky one. in DISIWrapper this is my fix: public int lastApproxMatchDoc = -2; // last doc of approximation that did match public int lastApproxNonMatchDoc = -2; // last doc of approximation that did not match Previously only one of these was initialized, the other was initialized to zero. So this could cause problems for docid 0.
        Hide
        Robert Muir added a comment -

        Paul's patch just with the above changes to fix the test failure.

        I will look at benchmarking, API changes, and beasting tests to move this forward. Thanks Paul!

        Show
        Robert Muir added a comment - Paul's patch just with the above changes to fix the test failure. I will look at benchmarking, API changes, and beasting tests to move this forward. Thanks Paul!
        Hide
        Paul Elschot added a comment -

        Great, it was indeed almost there.
        Before benchmarking there is another (small) optimization possible: topPositionSpans can be put at the object level and maintained together with byPositionQueue, and then it can also be tested for null in start/endPosition.

        Show
        Paul Elschot added a comment - Great, it was indeed almost there. Before benchmarking there is another (small) optimization possible: topPositionSpans can be put at the object level and maintained together with byPositionQueue, and then it can also be tested for null in start/endPosition.
        Hide
        Paul Elschot added a comment -

        Patch of 20 April 2015. This includes the fix by Robert in SpanOr, and:
        is based on todays trunk,
        has topPositionSpans is at object level,
        renames top to topDocSpans,
        adds a check for topPositionSpan==null in the payload methods,
        has SpanPositionQueue in its own java file,
        adds method public static TwoPhaseIterator.asTwoPhaseIterator, instead of DocIdSeterator.asTwoPhaseIterator.

        Show
        Paul Elschot added a comment - Patch of 20 April 2015. This includes the fix by Robert in SpanOr, and: is based on todays trunk, has topPositionSpans is at object level, renames top to topDocSpans, adds a check for topPositionSpan==null in the payload methods, has SpanPositionQueue in its own java file, adds method public static TwoPhaseIterator.asTwoPhaseIterator, instead of DocIdSeterator.asTwoPhaseIterator.
        Hide
        Paul Elschot added a comment -

        Specialized version of SpanPositionQueue. This has a private static boolean lessThan(Spans,Spans) which might be good enough in combination with the JIT, and unused methods from PriorityQueue are removed.

        Show
        Paul Elschot added a comment - Specialized version of SpanPositionQueue. This has a private static boolean lessThan(Spans,Spans) which might be good enough in combination with the JIT, and unused methods from PriorityQueue are removed.
        Hide
        Robert Muir added a comment -

        +1 for this patch, it is very nice. I love that now spans share both conjunction and disjunction logic and support two-phase execution everywhere.

        I benchmarked and found no performance regressions with a variety of tasks, like filter intersection with disjunctions, minshouldmatch queries, etc.

        I created a contrived benchmark for SpanOr, "spanMultiPhrase" which just reuses the same tasks file from LUCENE-6421, but instead executes as SpanNear(SpanOr, ...). It takes 15 minutes per iteration, and isn't the greatest benchmark but so far i see up to 10% improvement in some cases for that benchmark.

        I will do some final tests and plan to commit this soon.

        Show
        Robert Muir added a comment - +1 for this patch, it is very nice. I love that now spans share both conjunction and disjunction logic and support two-phase execution everywhere. I benchmarked and found no performance regressions with a variety of tasks, like filter intersection with disjunctions, minshouldmatch queries, etc. I created a contrived benchmark for SpanOr, "spanMultiPhrase" which just reuses the same tasks file from LUCENE-6421 , but instead executes as SpanNear(SpanOr, ...). It takes 15 minutes per iteration, and isn't the greatest benchmark but so far i see up to 10% improvement in some cases for that benchmark. I will do some final tests and plan to commit this soon.
        Hide
        ASF subversion and git services added a comment -

        Commit 1675776 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1675776 ]

        LUCENE-6373: complete two phase doc id iteration support for Spans

        Show
        ASF subversion and git services added a comment - Commit 1675776 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1675776 ] LUCENE-6373 : complete two phase doc id iteration support for Spans
        Hide
        ASF subversion and git services added a comment -

        Commit 1675780 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1675780 ]

        LUCENE-6373: complete two phase doc id iteration support for Spans

        Show
        ASF subversion and git services added a comment - Commit 1675780 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1675780 ] LUCENE-6373 : complete two phase doc id iteration support for Spans
        Hide
        Robert Muir added a comment -

        Paul, thank you!

        Show
        Robert Muir added a comment - Paul, thank you!
        Hide
        Paul Elschot added a comment -

        My pleasure, thanks for benchmarking.
        A last review found a no more used variable in nextStartPosition(), sorry:

        diff --git a/lucene/core/src/java/org/apache/lucene/search/spans/SpanOrQuery.java b/lucene/core/src/java/org/apache/l
        index eca3635..9d0d09a 100644
        --- a/lucene/core/src/java/org/apache/lucene/search/spans/SpanOrQuery.java
        +++ b/lucene/core/src/java/org/apache/lucene/search/spans/SpanOrQuery.java
        @@ -288,8 +288,6 @@ public class SpanOrQuery extends SpanQuery implements Cloneable {
                 
               @Override
               public int nextStartPosition() throws IOException {
        -        DisiWrapper<Spans> topDocSpans = byDocQueue.top();
        -        assert topDocSpans.doc != NO_MORE_DOCS;
                 if (topPositionSpans == null) {
                   byPositionQueue.clear();
                   fillPositionQueue(); // fills byPositionQueue at first position
        
        Show
        Paul Elschot added a comment - My pleasure, thanks for benchmarking. A last review found a no more used variable in nextStartPosition(), sorry: diff --git a/lucene/core/src/java/org/apache/lucene/search/spans/SpanOrQuery.java b/lucene/core/src/java/org/apache/l index eca3635..9d0d09a 100644 --- a/lucene/core/src/java/org/apache/lucene/search/spans/SpanOrQuery.java +++ b/lucene/core/src/java/org/apache/lucene/search/spans/SpanOrQuery.java @@ -288,8 +288,6 @@ public class SpanOrQuery extends SpanQuery implements Cloneable { @Override public int nextStartPosition() throws IOException { - DisiWrapper<Spans> topDocSpans = byDocQueue.top(); - assert topDocSpans.doc != NO_MORE_DOCS; if (topPositionSpans == null ) { byPositionQueue.clear(); fillPositionQueue(); // fills byPositionQueue at first position
        Hide
        Paul Elschot added a comment -

        Patch of 26 April 2015. Includes the removal of the unused variable in the Spans.nextStartPosition() method of SpanOr, and two lines to suppress compilation warnings (unchecked, rawtypes) for the generics used in DisiPriorityQueue and MinShouldMatchSumScorer.

        Show
        Paul Elschot added a comment - Patch of 26 April 2015. Includes the removal of the unused variable in the Spans.nextStartPosition() method of SpanOr, and two lines to suppress compilation warnings (unchecked, rawtypes) for the generics used in DisiPriorityQueue and MinShouldMatchSumScorer.
        Hide
        Paul Elschot added a comment -

        The specialization of SpanPositionQueue is at LUCENE-6453, I just removed the earlier version that I posted here.

        Show
        Paul Elschot added a comment - The specialization of SpanPositionQueue is at LUCENE-6453 , I just removed the earlier version that I posted here.
        Hide
        Paul Elschot added a comment -

        Reopened because of the patch of 26 April 2015.

        Show
        Paul Elschot added a comment - Reopened because of the patch of 26 April 2015.
        Hide
        ASF subversion and git services added a comment -

        Commit 1676464 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1676464 ]

        LUCENE-6373: cleanup unused variable and generic array warnings

        Show
        ASF subversion and git services added a comment - Commit 1676464 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1676464 ] LUCENE-6373 : cleanup unused variable and generic array warnings
        Hide
        ASF subversion and git services added a comment -

        Commit 1676466 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1676466 ]

        LUCENE-6373: cleanup unused variable and generic array warnings

        Show
        ASF subversion and git services added a comment - Commit 1676466 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1676466 ] LUCENE-6373 : cleanup unused variable and generic array warnings
        Hide
        Robert Muir added a comment -

        Thanks for the cleanup patch Paul, I just committed it.

        Show
        Robert Muir added a comment - Thanks for the cleanup patch Paul, I just committed it.

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development