Lucene - Core
  1. Lucene - Core
  2. LUCENE-6393

Add two-phase support to SpanPositionCheckQuery and subclasses

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This hits an assert in SpanScorer because it breaks the javadocs contract of Spans.nextStartPosition():

      • Returns the next start position for the current doc.
      • There is always at least one start/end position per doc.
      • After the last start/end position at the current doc this returns NO_MORE_POSITIONS.
      1. LUCENE-6393.patch
        6 kB
        Robert Muir
      2. LUCENE-6393.patch
        4 kB
        Paul Elschot
      3. LUCENE-6393.patch
        3 kB
        Robert Muir
      4. LUCENE-6393-alternative.patch
        4 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        I think SpanPayloadCheckQuery, SpanPositionRangeQuery, or any other subclasses of SpanPositionCheckQuery are likely impacted.

        Show
        Robert Muir added a comment - I think SpanPayloadCheckQuery, SpanPositionRangeQuery, or any other subclasses of SpanPositionCheckQuery are likely impacted.
        Hide
        Robert Muir added a comment -

        Oops. i resolved the wrong issue on accident.

        Show
        Robert Muir added a comment - Oops. i resolved the wrong issue on accident.
        Hide
        Robert Muir added a comment -

        This is pretty easy to find by just doing a SpanFirstQuery(someNearQuery, 0).

        It only happens when it wraps a span-near query, when it wraps a simple termquery it never does the wrong thing.

        Show
        Robert Muir added a comment - This is pretty easy to find by just doing a SpanFirstQuery(someNearQuery, 0). It only happens when it wraps a span-near query, when it wraps a simple termquery it never does the wrong thing.
        Hide
        ASF subversion and git services added a comment -

        Commit 1671137 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1671137 ]

        LUCENE-6393: add equivalence tests for SpanFirstQuery.

        Show
        ASF subversion and git services added a comment - Commit 1671137 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1671137 ] LUCENE-6393 : add equivalence tests for SpanFirstQuery.
        Hide
        Robert Muir added a comment -

        I committed tests marked with @AwaitsFix to TestSpanSearchEquivalence for now.
        Was working to get those beefed up a little bit and cover more of the spans so we could have more confidence.

        Show
        Robert Muir added a comment - I committed tests marked with @AwaitsFix to TestSpanSearchEquivalence for now. Was working to get those beefed up a little bit and cover more of the spans so we could have more confidence.
        Hide
        ASF subversion and git services added a comment -

        Commit 1671139 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1671139 ]

        LUCENE-6393: add equivalence tests for SpanFirstQuery.

        Show
        ASF subversion and git services added a comment - Commit 1671139 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1671139 ] LUCENE-6393 : add equivalence tests for SpanFirstQuery.
        Hide
        Paul Elschot added a comment -

        The problem is that with two phase iteration neither nextDoc() or advance() is called on SpanPositionCheckQuery.PositionCheckSpans in the testSpanFirstNear test case, leaving the atFirstInCurrentDoc flag in the wrong state.

        Overriding PositionCheckSpans.asTwoPhaseIterator to return null makes the testSpanFirstNear test case pass.

        At the moment I have no idea what the good fix would be.
        Perhaps FilterSpans should return null for its two phase iterator just to be on the safe side.

        Show
        Paul Elschot added a comment - The problem is that with two phase iteration neither nextDoc() or advance() is called on SpanPositionCheckQuery.PositionCheckSpans in the testSpanFirstNear test case, leaving the atFirstInCurrentDoc flag in the wrong state. Overriding PositionCheckSpans.asTwoPhaseIterator to return null makes the testSpanFirstNear test case pass. At the moment I have no idea what the good fix would be. Perhaps FilterSpans should return null for its two phase iterator just to be on the safe side.
        Hide
        Paul Elschot added a comment -

        Because of the Spans contract that there is always at least one start/end position per doc, I considered adding a firstStartPosition method that should be called before the first nextStartPosition call.
        That would reduce the need for the atFirstInCurrentDoc flag that is used in quite a few Spans, and that might help to avoid problems like this one.

        Show
        Paul Elschot added a comment - Because of the Spans contract that there is always at least one start/end position per doc, I considered adding a firstStartPosition method that should be called before the first nextStartPosition call. That would reduce the need for the atFirstInCurrentDoc flag that is used in quite a few Spans, and that might help to avoid problems like this one.
        Hide
        Robert Muir added a comment -

        We can also give SpanPositionCheck two-phase support? I wrote these tests with that in mind. I just noticed the failed before i made any code changes...

        Show
        Robert Muir added a comment - We can also give SpanPositionCheck two-phase support? I wrote these tests with that in mind. I just noticed the failed before i made any code changes...
        Hide
        Robert Muir added a comment -

        Here is a very quickly-done patch, adding two-phase support to SpanPositionCheck. Paul Elschot, mind taking a look?

        Show
        Robert Muir added a comment - Here is a very quickly-done patch, adding two-phase support to SpanPositionCheck. Paul Elschot , mind taking a look?
        Hide
        Robert Muir added a comment -

        The problem is that with two phase iteration neither nextDoc() or advance() is called on SpanPositionCheckQuery.PositionCheckSpans in the testSpanFirstNear test case, leaving the atFirstInCurrentDoc flag in the wrong state.

        Overriding PositionCheckSpans.asTwoPhaseIterator to return null makes the testSpanFirstNear test case pass.

        At the moment I have no idea what the good fix would be.
        Perhaps FilterSpans should return null for its two phase iterator just to be on the safe side.

        Ok, I understand the issue. Another option to consider is to take the implementation I added here and move it into filterspans, adding 'abstract boolean twoPhaseCurrentDocMatches()' to FilterSpans. Then it would have built in approximation support. I just have to look at NearSpansUnordered.SpansCell and figure out what the implementation there should be (maybe just 'return true', i am just not familiar with the code).

        Show
        Robert Muir added a comment - The problem is that with two phase iteration neither nextDoc() or advance() is called on SpanPositionCheckQuery.PositionCheckSpans in the testSpanFirstNear test case, leaving the atFirstInCurrentDoc flag in the wrong state. Overriding PositionCheckSpans.asTwoPhaseIterator to return null makes the testSpanFirstNear test case pass. At the moment I have no idea what the good fix would be. Perhaps FilterSpans should return null for its two phase iterator just to be on the safe side. Ok, I understand the issue. Another option to consider is to take the implementation I added here and move it into filterspans, adding 'abstract boolean twoPhaseCurrentDocMatches()' to FilterSpans. Then it would have built in approximation support. I just have to look at NearSpansUnordered.SpansCell and figure out what the implementation there should be (maybe just 'return true', i am just not familiar with the code).
        Hide
        Robert Muir added a comment -

        Here is that second approach (LUCENE-6393-alternative.patch). I like this because it forces a subclass of FilterSpans to deal with it.

        Show
        Robert Muir added a comment - Here is that second approach ( LUCENE-6393 -alternative.patch). I like this because it forces a subclass of FilterSpans to deal with it.
        Hide
        Robert Muir added a comment -

        Because of the Spans contract that there is always at least one start/end position per doc, I considered adding a firstStartPosition method that should be called before the first nextStartPosition call.
        That would reduce the need for the atFirstInCurrentDoc flag that is used in quite a few Spans, and that might help to avoid problems like this one.

        Well, I'm not sure it would avoid all cases of this problem, which IMO is a trap in FilterSpans. If we return null in FilterSpans by default, thats also a trap (just a performance trap instead).

        Separately, it might be a good idea, but maybe we should add the remaining two-phase support first (SpanOrQuery, SpanNotQuery) and then try to look at refactoring after have a clear picture of all the stuff involved? I started with this query because it seemed easiest. The next step would be to give SpanOr and SpanNot logic similar to DisjunctionScorer and ReqExclScorer. Another step after that is to see if they can share two-phase code (like SpanNear does with ConjunctionDISI).

        Show
        Robert Muir added a comment - Because of the Spans contract that there is always at least one start/end position per doc, I considered adding a firstStartPosition method that should be called before the first nextStartPosition call. That would reduce the need for the atFirstInCurrentDoc flag that is used in quite a few Spans, and that might help to avoid problems like this one. Well, I'm not sure it would avoid all cases of this problem, which IMO is a trap in FilterSpans. If we return null in FilterSpans by default, thats also a trap (just a performance trap instead). Separately, it might be a good idea, but maybe we should add the remaining two-phase support first (SpanOrQuery, SpanNotQuery) and then try to look at refactoring after have a clear picture of all the stuff involved? I started with this query because it seemed easiest. The next step would be to give SpanOr and SpanNot logic similar to DisjunctionScorer and ReqExclScorer. Another step after that is to see if they can share two-phase code (like SpanNear does with ConjunctionDISI).
        Hide
        Paul Elschot added a comment -

        With code shared between normal and two phase the code is actually easier to write, see the patch at LUCENE-6083.
        For two phase return a boolean indicating a match, for normal iterate to a matching docID as necessary and return it, and in both cases prepare the first position if there is a match.

        Show
        Paul Elschot added a comment - With code shared between normal and two phase the code is actually easier to write, see the patch at LUCENE-6083 . For two phase return a boolean indicating a match, for normal iterate to a matching docID as necessary and return it, and in both cases prepare the first position if there is a match.
        Hide
        Paul Elschot added a comment - - edited

        Patch of 4 April 2015.
        This removes the asTwoPhaseIterator method in FilterSpans.java and the @AwaitsFix lines in /TestSpanSearchEquivalence.
        The patch also deletes TestFilterSpans, which only repeats checks already done by the compiler.

        Show
        Paul Elschot added a comment - - edited Patch of 4 April 2015. This removes the asTwoPhaseIterator method in FilterSpans.java and the @AwaitsFix lines in /TestSpanSearchEquivalence. The patch also deletes TestFilterSpans, which only repeats checks already done by the compiler.
        Hide
        Paul Elschot added a comment -

        I missed the earlier two patches, I'll take a look at them later.

        Show
        Paul Elschot added a comment - I missed the earlier two patches, I'll take a look at them later.
        Hide
        Paul Elschot added a comment -

        The current patches partially overlap, and I see no conflicts between them.

        Show
        Paul Elschot added a comment - The current patches partially overlap, and I see no conflicts between them.
        Hide
        Robert Muir added a comment -

        Sorry the patch I think is the best solution is LUCENE-6393-alternative.patch

        The other one was just my first stab attempt, it does not actually fix FilterSpans.

        Show
        Robert Muir added a comment - Sorry the patch I think is the best solution is LUCENE-6393 -alternative.patch The other one was just my first stab attempt, it does not actually fix FilterSpans.
        Hide
        Paul Elschot added a comment -

        A first impression the alternative patch.
        In case for Spans it makes sense to enforce two phase iteration everywhere except in TermSpans, it could be done in for example in an abstract class TwoPhaseSpans (a Spans subclass) that should then be used for all other Spans than TermSpans.
        Doing that in FilterSpans as in the alternative patch could be a first step for that.

        About this TODO:

        // TODO: can/should we combine this with the logic above?
        +    public boolean twoPhaseCurrentDocMatches
        

        Such sharing can be done by calling this twoPhaseCurrentDocMatches from the toMatchDoc method that is normally used at the end of nextDoc() and advance(). The SpanContain... queries at LUCENE-6083 have this in ContainSpans.java:

        +  @Override
        +  int toMatchDoc() throws IOException {
        +    oneExhaustedInCurrentDoc = false;
        +    while (true) {
        +      if (twoPhaseCurrentDocMatches()) {
        +        return docID();
        +      }
        +      if (nextDoc() == NO_MORE_DOCS) {
        +        return NO_MORE_DOCS;
        +      }
        +    }
        +  }
        

        I tried to share the same method also with the NearSpansUn/Ordered classes, but when I did that some tests failed (maybe because the atFirstInCurrentDoc flag is not yet handled correctly.)

        Show
        Paul Elschot added a comment - A first impression the alternative patch. In case for Spans it makes sense to enforce two phase iteration everywhere except in TermSpans, it could be done in for example in an abstract class TwoPhaseSpans (a Spans subclass) that should then be used for all other Spans than TermSpans. Doing that in FilterSpans as in the alternative patch could be a first step for that. About this TODO: // TODO: can/should we combine this with the logic above? + public boolean twoPhaseCurrentDocMatches Such sharing can be done by calling this twoPhaseCurrentDocMatches from the toMatchDoc method that is normally used at the end of nextDoc() and advance(). The SpanContain... queries at LUCENE-6083 have this in ContainSpans.java: + @Override + int toMatchDoc() throws IOException { + oneExhaustedInCurrentDoc = false ; + while ( true ) { + if (twoPhaseCurrentDocMatches()) { + return docID(); + } + if (nextDoc() == NO_MORE_DOCS) { + return NO_MORE_DOCS; + } + } + } I tried to share the same method also with the NearSpansUn/Ordered classes, but when I did that some tests failed (maybe because the atFirstInCurrentDoc flag is not yet handled correctly.)
        Hide
        Paul Elschot added a comment - - edited

        Looking again at the code above, I think it recurses with nextDoc() and toMatchDoc() calling each other, and that was not the intention.
        This can be fixed by using only the first part of the implementation of nextDoc() in toMatchDoc().

        Show
        Paul Elschot added a comment - - edited Looking again at the code above, I think it recurses with nextDoc() and toMatchDoc() calling each other, and that was not the intention. This can be fixed by using only the first part of the implementation of nextDoc() in toMatchDoc().
        Hide
        Robert Muir added a comment -

        Updated patch with Paul's idea.

        I like it, maybe we should consider this logic being migrated to FilterSpans... but we could do this as a second step?

        Show
        Robert Muir added a comment - Updated patch with Paul's idea. I like it, maybe we should consider this logic being migrated to FilterSpans... but we could do this as a second step?
        Hide
        Paul Elschot added a comment - - edited

        Lots of deleted lines, good sign.
        For migrating this logic I think one step at a time should work. Let's see what the other Span queries need for two phase iteration.
        The others are SpanOr/MultiTerm and PayloadTermQuery, or did I miss one?

        Show
        Paul Elschot added a comment - - edited Lots of deleted lines, good sign. For migrating this logic I think one step at a time should work. Let's see what the other Span queries need for two phase iteration. The others are SpanOr/MultiTerm and PayloadTermQuery, or did I miss one?
        Hide
        Robert Muir added a comment -

        I think we can do SpanNot as well, similar to what ReqExclScorer does.

        Show
        Robert Muir added a comment - I think we can do SpanNot as well, similar to what ReqExclScorer does.
        Hide
        ASF subversion and git services added a comment -

        Commit 1671420 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1671420 ]

        LUCENE-6393: Add two-phase support to SpanPositionCheckQuery and subclasses

        Show
        ASF subversion and git services added a comment - Commit 1671420 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1671420 ] LUCENE-6393 : Add two-phase support to SpanPositionCheckQuery and subclasses
        Hide
        Robert Muir added a comment -

        I used one final trick in the two-phase for FilterSpans:

          @Override
          public TwoPhaseIterator asTwoPhaseIterator() {
            final TwoPhaseIterator inner = in.asTwoPhaseIterator();
            if (inner != null) {
              // wrapped instance has an approximation
              return new TwoPhaseIterator(inner.approximation()) {
                @Override
                public boolean matches() throws IOException {
                  return inner.matches() && twoPhaseCurrentDocMatches();
                }
              };
            } else {
              // wrapped instance has no approximation, but 
              // we can still defer matching until absolutely needed.
              return new TwoPhaseIterator(in) {
                @Override
                public boolean matches() throws IOException {
                  return twoPhaseCurrentDocMatches();
                }
              };
            }
          }
        

        This gives two-phase support for stuff like SpanFirst(SpanTerm). We may want to make this easier later (e.g. mandate approximations in Spans and give SpanTerm the obvious sorta-stupid implementation so code is easier). But for now lets get support in all the spans so we see what everything looks like.

        Show
        Robert Muir added a comment - I used one final trick in the two-phase for FilterSpans: @Override public TwoPhaseIterator asTwoPhaseIterator() { final TwoPhaseIterator inner = in.asTwoPhaseIterator(); if ( inner != null ) { // wrapped instance has an approximation return new TwoPhaseIterator( inner .approximation()) { @Override public boolean matches() throws IOException { return inner .matches() && twoPhaseCurrentDocMatches(); } }; } else { // wrapped instance has no approximation, but // we can still defer matching until absolutely needed. return new TwoPhaseIterator(in) { @Override public boolean matches() throws IOException { return twoPhaseCurrentDocMatches(); } }; } } This gives two-phase support for stuff like SpanFirst(SpanTerm). We may want to make this easier later (e.g. mandate approximations in Spans and give SpanTerm the obvious sorta-stupid implementation so code is easier). But for now lets get support in all the spans so we see what everything looks like.
        Hide
        ASF subversion and git services added a comment -

        Commit 1671421 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1671421 ]

        LUCENE-6393: Add two-phase support to SpanPositionCheckQuery and subclasses

        Show
        ASF subversion and git services added a comment - Commit 1671421 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1671421 ] LUCENE-6393 : Add two-phase support to SpanPositionCheckQuery and subclasses
        Hide
        Robert Muir added a comment -

        Thanks for the help here Paul.

        Show
        Robert Muir added a comment - Thanks for the help here Paul.
        Hide
        Anshum Gupta added a comment -

        Bulk close for 5.2.0.

        Show
        Anshum Gupta added a comment - Bulk close for 5.2.0.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development