Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 5.5, 6.x
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Example for a nested SpanQuery that is not working:

      Document: Human Genome Organization , HUGO , is trying to coordinate gene mapping research worldwide.

      Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 0, true), body:gene]), body:research], 0, true)

      The query should match "coordinate gene mapping research" as well as "coordinate gene research". It does not match "coordinate gene mapping research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It probably stopped working with the changes on SpanQueries in 5.3. I will attach a unit test that shows the problem.

      1. LUCENE-7398.patch
        30 kB
        Paul Elschot
      2. LUCENE-7398.patch
        5 kB
        Alan Woodward
      3. LUCENE-7398.patch
        5 kB
        Alan Woodward
      4. LUCENE-7398-20160814.patch
        11 kB
        Paul Elschot
      5. LUCENE-7398-20160924.patch
        32 kB
        Paul Elschot
      6. LUCENE-7398-20160925.patch
        37 kB
        Paul Elschot
      7. TestSpanCollection.java
        9 kB
        Christoph Goller

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          First question:
          Is there any utility in a patch that would throw an exception if it spotted a sibling that shares a term with a niece (or descendant) as in the original problem? At least let the user know that the returned docs might not be correct? My little app relies on SpanQueries for retrieval, and this bug is potentially a real problem.

          Second question: would it be of any use if I supplied a PR for @Ignore'd test cases that currently fail?

          Third question:
          For the actual solution, do we have to back off to full dynamic programming (Earley or more modern equivalents)? Or will Paul Elschot's recommendations above work?

          Show
          tallison@mitre.org Tim Allison added a comment - First question: Is there any utility in a patch that would throw an exception if it spotted a sibling that shares a term with a niece (or descendant) as in the original problem? At least let the user know that the returned docs might not be correct? My little app relies on SpanQueries for retrieval, and this bug is potentially a real problem. Second question: would it be of any use if I supplied a PR for @Ignore'd test cases that currently fail? Third question: For the actual solution, do we have to back off to full dynamic programming (Earley or more modern equivalents)? Or will Paul Elschot 's recommendations above work?
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited

          One way to view the problem is that when span end positions are used to determine the slop, it becomes impossible to determine an order for moving the subspans to a next position.

          So one direction out of this could be: use NearSpans that determines the slop only by the start positions of the subspans. That leaves only the cases in which the subspans can start (and maybe also end) at the same position.
          To make sure that all the subspans move forward after a match we could move them all forward until after the current match, and while doing that also count/collect them for scoring/highlighting as long as they are within the match. That should solve the bug reported here, which is about scoring a missed matching occurrence.

          This limits the required slop to using only the starting positions of the subspans. Could this work?

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited One way to view the problem is that when span end positions are used to determine the slop, it becomes impossible to determine an order for moving the subspans to a next position. So one direction out of this could be: use NearSpans that determines the slop only by the start positions of the subspans. That leaves only the cases in which the subspans can start (and maybe also end) at the same position. To make sure that all the subspans move forward after a match we could move them all forward until after the current match, and while doing that also count/collect them for scoring/highlighting as long as they are within the match. That should solve the bug reported here, which is about scoring a missed matching occurrence. This limits the required slop to using only the starting positions of the subspans. Could this work?
          Hide
          alukanin Artem Lukanin added a comment -

          The patch has a bug. The following sentence is not found, because the look-ahead is too greedy: "the system of claim 16 further comprising a user location unit adapted to determine user location based on location information received from the user's device"

            @Test
            public void testNestedOrQueryLookAhead() throws IOException {
              SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                  .addClause(new SpanOrQuery(
                      new SpanTermQuery(new Term(FIELD, "user")),
                      new SpanTermQuery(new Term(FIELD, "ue"))
                  ))
                  .addClause(new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                      .setSlop(3)
                      .addClause(new SpanTermQuery(new Term(FIELD, "location")))
                      .addClause(new SpanTermQuery(new Term(FIELD, "information")))
                      .build()
                  )
                  .build();
          
              Spans spans = snq.createWeight(searcher, false).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
              assertEquals(6, spans.advance(0));
              assertEquals(Spans.NO_MORE_DOCS, spans.nextDoc());
            }
          

          The fix is simple, there should be an additional check inside shrinkToDecreaseSlop():

            /** The subSpans are ordered in the same doc and matchSlop is too big.
             * Try and decrease the slop by calling nextStartPosition() on all subSpans except the last one in reverse order.
             * Return true iff an ordered match was found with small enough slop.
             */
            private boolean shrinkToDecreaseSlop() throws IOException {
              int lastStart = subSpans[subSpans.length - 1].startPosition();
          
              for (int i = subSpans.length - 2; i >= 1; i--) { // intermediate spans for subSpans.length >= 3
                Spans prevSpans = subSpans[i];
                int prevStart = prevSpans.startPosition();
                int prevEnd = prevSpans.endPosition();
                while (true) { // Advance prevSpans until it is after (lastStart, lastEnd) or the slop increases.
                  if (prevSpans.nextStartPosition() == NO_MORE_POSITIONS) {
                    oneExhaustedInCurrentDoc = true;
                    break; // Check remaining subSpans for final match in current doc
                  } else {
                    int ppEnd = prevSpans.endPosition();
                    if (ppEnd > lastStart) { // no more ordered
                      break; // Check remaining subSpans.
                    } else { // prevSpans still before lastStart
                      int ppStart = prevSpans.startPosition();
                      int slopIncrease = (prevEnd - prevStart) - (ppEnd - ppStart); // span length decrease is slop increase
                      if (slopIncrease > 0) {
                        break; // Check remaining subSpans.
                      } else { // slop did not increase
                          prevStart = ppStart;
                          prevEnd = ppEnd;
                          matchSlop += slopIncrease;
                        }
                      }
                    }
                  }
                lastStart = prevStart;
              }
          
              while (true) { // for subSpans[0] only the end position influences the match slop.
                int prevEnd = subSpans[0].endPosition();
                if (subSpans[0].nextStartPosition() == NO_MORE_POSITIONS) {
                  oneExhaustedInCurrentDoc = true;
                  break;
                }
                int ppEnd = subSpans[0].endPosition();
                if (ppEnd > lastStart) { // no more ordered
                  break;
                }
                int slopIncrease = prevEnd - ppEnd;
                if (slopIncrease > 0) {
                  break;
                }
                // slop did not increase:
                matchStart = subSpans[0].startPosition();
                matchSlop += slopIncrease;
          
                // FIX STARTS
                if (matchSlop <= allowedSlop) {
                  break;
                }
                // FIX ENDS
              }
          
              firstSubSpansAfterMatch = true;
              boolean match = matchSlop <= allowedSlop;
              return match; // ordered and allowed slop
            }
          

          Sorry for not providing a new patch. I'm on a previous version of Lucene.

          Show
          alukanin Artem Lukanin added a comment - The patch has a bug. The following sentence is not found, because the look-ahead is too greedy: "the system of claim 16 further comprising a user location unit adapted to determine user location based on location information received from the user's device" @Test public void testNestedOrQueryLookAhead() throws IOException { SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .addClause( new SpanOrQuery( new SpanTermQuery( new Term(FIELD, "user" )), new SpanTermQuery( new Term(FIELD, "ue" )) )) .addClause( new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .setSlop(3) .addClause( new SpanTermQuery( new Term(FIELD, "location" ))) .addClause( new SpanTermQuery( new Term(FIELD, "information" ))) .build() ) .build(); Spans spans = snq.createWeight(searcher, false ).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS); assertEquals(6, spans.advance(0)); assertEquals(Spans.NO_MORE_DOCS, spans.nextDoc()); } The fix is simple, there should be an additional check inside shrinkToDecreaseSlop(): /** The subSpans are ordered in the same doc and matchSlop is too big. * Try and decrease the slop by calling nextStartPosition() on all subSpans except the last one in reverse order. * Return true iff an ordered match was found with small enough slop. */ private boolean shrinkToDecreaseSlop() throws IOException { int lastStart = subSpans[subSpans.length - 1].startPosition(); for ( int i = subSpans.length - 2; i >= 1; i--) { // intermediate spans for subSpans.length >= 3 Spans prevSpans = subSpans[i]; int prevStart = prevSpans.startPosition(); int prevEnd = prevSpans.endPosition(); while ( true ) { // Advance prevSpans until it is after (lastStart, lastEnd) or the slop increases. if (prevSpans.nextStartPosition() == NO_MORE_POSITIONS) { oneExhaustedInCurrentDoc = true ; break ; // Check remaining subSpans for final match in current doc } else { int ppEnd = prevSpans.endPosition(); if (ppEnd > lastStart) { // no more ordered break ; // Check remaining subSpans. } else { // prevSpans still before lastStart int ppStart = prevSpans.startPosition(); int slopIncrease = (prevEnd - prevStart) - (ppEnd - ppStart); // span length decrease is slop increase if (slopIncrease > 0) { break ; // Check remaining subSpans. } else { // slop did not increase prevStart = ppStart; prevEnd = ppEnd; matchSlop += slopIncrease; } } } } lastStart = prevStart; } while ( true ) { // for subSpans[0] only the end position influences the match slop. int prevEnd = subSpans[0].endPosition(); if (subSpans[0].nextStartPosition() == NO_MORE_POSITIONS) { oneExhaustedInCurrentDoc = true ; break ; } int ppEnd = subSpans[0].endPosition(); if (ppEnd > lastStart) { // no more ordered break ; } int slopIncrease = prevEnd - ppEnd; if (slopIncrease > 0) { break ; } // slop did not increase: matchStart = subSpans[0].startPosition(); matchSlop += slopIncrease; // FIX STARTS if (matchSlop <= allowedSlop) { break ; } // FIX ENDS } firstSubSpansAfterMatch = true ; boolean match = matchSlop <= allowedSlop; return match; // ordered and allowed slop } Sorry for not providing a new patch. I'm on a previous version of Lucene.
          Hide
          alukanin Artem Lukanin added a comment -

          Actually testNestedOrQuery4 works if I setSlop(3). I forgot to take into account one more Span, when transferring binary-nested clauses into 3-clauses SpanNearQuery.

          Show
          alukanin Artem Lukanin added a comment - Actually testNestedOrQuery4 works if I setSlop(3). I forgot to take into account one more Span, when transferring binary-nested clauses into 3-clauses SpanNearQuery.
          Hide
          alukanin Artem Lukanin added a comment -

          This issue describes only a partial problem, when 2 SpanTerms are at the same position inside SpanOr. But there is a general problem, when SpanTerms has different positions. For example, if I want to find this text "aa bb fineness cc ee colority dd" these queries will not find it, because only the first SpanTerm from 2 is taken into account:

            @Test
            public void testNestedOrQuery3() throws IOException {
              SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                  .addClause(
                      new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                          .addClause(new SpanTermQuery(new Term(FIELD, "aa")))
                          .addClause(new SpanOrQuery(
                              new SpanTermQuery(new Term(FIELD, "bb")),
                              new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                                  .addClause(new SpanTermQuery(new Term(FIELD, "cc")))
                                  .addClause(new SpanTermQuery(new Term(FIELD, "ee")))
                                  .setSlop(2)
                                  .build()
                          ))
                          .setSlop(2)
                          .build()
                  )
                  .addClause(new SpanTermQuery(new Term(FIELD, "dd")))
                  .setSlop(2)
                  .build();
          
              Spans spans = snq.createWeight(searcher, false).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
              assertEquals(8, spans.advance(8));
            }
          
            @Test
            public void testNestedOrQuery4() throws IOException {
              SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                  .addClause(new SpanTermQuery(new Term(FIELD, "aa")))
                  .addClause(new SpanOrQuery(
                      new SpanTermQuery(new Term(FIELD, "bb")),
                      SpanNearQuery.newOrderedNearQuery(FIELD)
                          .addClause(new SpanTermQuery(new Term(FIELD, "cc")))
                          .addClause(new SpanTermQuery(new Term(FIELD, "ee")))
                          .setSlop(2)
                          .build()
                  ))
                  .addClause(new SpanTermQuery(new Term(FIELD, "dd")))
                  .setSlop(2)
                  .build();
          
              Spans spans = snq.createWeight(searcher, false).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
              assertEquals(8, spans.advance(8));
            }
          

          Also, the patch only works for SpanNear of more than 2 subclauses and the same binary-clauses test does not work either:

            @Test
            public void testNestedOrQuery2() throws IOException {
              SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                  .addClause(
                      new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                          .addClause(new SpanTermQuery(new Term(FIELD, "coordinate")))
                          .addClause(new SpanOrQuery(
                              new SpanTermQuery(new Term(FIELD, "gene")),
                              new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD)
                                  .addClause(new SpanTermQuery(new Term(FIELD, "gene")))
                                  .addClause(new SpanTermQuery(new Term(FIELD, "mapping")))
                                  .build()
                          ))
                          .build()
                  )
                  .addClause(new SpanTermQuery(new Term(FIELD, "research")))
                  .build();
          
              Spans spans = snq.createWeight(searcher, false).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
              assertEquals(4, spans.advance(4));
              assertEquals(5, spans.nextDoc());
            }
          
          Show
          alukanin Artem Lukanin added a comment - This issue describes only a partial problem, when 2 SpanTerms are at the same position inside SpanOr. But there is a general problem, when SpanTerms has different positions. For example, if I want to find this text "aa bb fineness cc ee colority dd" these queries will not find it, because only the first SpanTerm from 2 is taken into account: @Test public void testNestedOrQuery3() throws IOException { SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .addClause( new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .addClause( new SpanTermQuery( new Term(FIELD, "aa" ))) .addClause( new SpanOrQuery( new SpanTermQuery( new Term(FIELD, "bb" )), new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .addClause( new SpanTermQuery( new Term(FIELD, "cc" ))) .addClause( new SpanTermQuery( new Term(FIELD, "ee" ))) .setSlop(2) .build() )) .setSlop(2) .build() ) .addClause( new SpanTermQuery( new Term(FIELD, "dd" ))) .setSlop(2) .build(); Spans spans = snq.createWeight(searcher, false ).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS); assertEquals(8, spans.advance(8)); } @Test public void testNestedOrQuery4() throws IOException { SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .addClause( new SpanTermQuery( new Term(FIELD, "aa" ))) .addClause( new SpanOrQuery( new SpanTermQuery( new Term(FIELD, "bb" )), SpanNearQuery.newOrderedNearQuery(FIELD) .addClause( new SpanTermQuery( new Term(FIELD, "cc" ))) .addClause( new SpanTermQuery( new Term(FIELD, "ee" ))) .setSlop(2) .build() )) .addClause( new SpanTermQuery( new Term(FIELD, "dd" ))) .setSlop(2) .build(); Spans spans = snq.createWeight(searcher, false ).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS); assertEquals(8, spans.advance(8)); } Also, the patch only works for SpanNear of more than 2 subclauses and the same binary-clauses test does not work either: @Test public void testNestedOrQuery2() throws IOException { SpanNearQuery snq = new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .addClause( new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .addClause( new SpanTermQuery( new Term(FIELD, "coordinate" ))) .addClause( new SpanOrQuery( new SpanTermQuery( new Term(FIELD, "gene" )), new SpanNearQuery.Builder(FIELD, SpanNearQuery.MatchNear.ORDERED_LOOKAHEAD) .addClause( new SpanTermQuery( new Term(FIELD, "gene" ))) .addClause( new SpanTermQuery( new Term(FIELD, "mapping" ))) .build() )) .build() ) .addClause( new SpanTermQuery( new Term(FIELD, "research" ))) .build(); Spans spans = snq.createWeight(searcher, false ).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS); assertEquals(4, spans.advance(4)); assertEquals(5, spans.nextDoc()); }
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          Michael McCandless, the patch is as you stated, and having MatchNear as an enum to choose the matching method is easy to extend.
          I would not mind to have some more opinions on whether the progress is enough to actually add the code.

          I know this MG4J paper and it could well be that theorem 11 in there proves that no lazy algorithm is possible for the general case with more than 2 subqueries, but for now I cannot really follow their terminology. In particular I'd like to know whether or not these efficient algorithms correspond to the current lazy implementations in Lucene. I'm hoping that they do not, because then there might be some room for improvement in Lucene without losing speed.

          As Christoph Goller stated above:

          I want a higher score if the user-query matches for more than one variant

          I don't think the ORDERED_LOOKAHEAD of the patch does that, because it only matches one variant.
          I hope that there is a non backtracking implementation that can do this, but I'm not sure.

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - Michael McCandless , the patch is as you stated, and having MatchNear as an enum to choose the matching method is easy to extend. I would not mind to have some more opinions on whether the progress is enough to actually add the code. I know this MG4J paper and it could well be that theorem 11 in there proves that no lazy algorithm is possible for the general case with more than 2 subqueries, but for now I cannot really follow their terminology. In particular I'd like to know whether or not these efficient algorithms correspond to the current lazy implementations in Lucene. I'm hoping that they do not, because then there might be some room for improvement in Lucene without losing speed. As Christoph Goller stated above: I want a higher score if the user-query matches for more than one variant I don't think the ORDERED_LOOKAHEAD of the patch does that, because it only matches one variant. I hope that there is a non backtracking implementation that can do this, but I'm not sure.
          Hide
          mikemccand Michael McCandless added a comment -

          Paul Elschot I was able to apply the patch to current master (there was one minor conflict) and tests passed.

          I looked at the patch, and it looks like it generalizes the previous boolean ordered into a trilean, adding a new ORDERED_LOOKEAHEAD option that resolves some (not all) of the issues raised here, right? And the other two options match the ordered=true and ordered=false cases today, so we have back compat? So net/net, while this is a public API change, it is a step forward ... progress not perfection.

          I wonder if (separately!) we could explore adding minimal interval semantics to Lucene, like MG4J: http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf ... I don't know much about it, but it sounds compelling and efficient

          Show
          mikemccand Michael McCandless added a comment - Paul Elschot I was able to apply the patch to current master (there was one minor conflict) and tests passed. I looked at the patch, and it looks like it generalizes the previous boolean ordered into a trilean, adding a new ORDERED_LOOKEAHEAD option that resolves some (not all) of the issues raised here, right? And the other two options match the ordered=true and ordered=false cases today, so we have back compat? So net/net, while this is a public API change, it is a step forward ... progress not perfection. I wonder if (separately!) we could explore adding minimal interval semantics to Lucene, like MG4J: http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf ... I don't know much about it, but it sounds compelling and efficient
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          is this latest patch ready to be committed, or are there still known problems?

          Both actually, assuming that master has not had a conflicting update since.
          To completely solve this backtracking is needed, and the patch does not provide that.

          To allow collecting/payloads easily, I'd rather accept the limitations/bugs of the current lazy implementation.
          As a minimum a reference to this issue could be added to the javadocs of the (un)ordered near spans.

          AFAIK:

          • a complete solution that can be made with lazy iteration is a span near query that has two subqueries
            and that only checks the span starting positions,
          • for subqueries that are terms or that do not vary in length, completeness for two subqueries is already there.

          In case there is interest in span near queries that only use starting positions, well, that should be easy.

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - is this latest patch ready to be committed, or are there still known problems? Both actually, assuming that master has not had a conflicting update since. To completely solve this backtracking is needed, and the patch does not provide that. To allow collecting/payloads easily, I'd rather accept the limitations/bugs of the current lazy implementation. As a minimum a reference to this issue could be added to the javadocs of the (un)ordered near spans. AFAIK: a complete solution that can be made with lazy iteration is a span near query that has two subqueries and that only checks the span starting positions, for subqueries that are terms or that do not vary in length, completeness for two subqueries is already there. In case there is interest in span near queries that only use starting positions, well, that should be easy.
          Hide
          mikemccand Michael McCandless added a comment -

          I can't quite tell from the comments/iterations here: is this latest patch ready to be committed, or are there still known problems?

          Alternatively, should we maybe revert the lazy iteration change (LUCENE-6537) if it is the root cause that broke previous cases?

          Show
          mikemccand Michael McCandless added a comment - I can't quite tell from the comments/iterations here: is this latest patch ready to be committed, or are there still known problems? Alternatively, should we maybe revert the lazy iteration change ( LUCENE-6537 ) if it is the root cause that broke previous cases?
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          Patch of 4 Oct 2016.

          This is the patch of 25 Sep 2016, but without the UNORDERED_STARTPOS case.

          In a nutshell this:

          • adds ORDERED_LOOKAHEAD,
          • is backward compatible,
          • tries to document the limitations of the matching methods for SpanNearQuery.
          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - Patch of 4 Oct 2016. This is the patch of 25 Sep 2016, but without the UNORDERED_STARTPOS case. In a nutshell this: adds ORDERED_LOOKAHEAD, is backward compatible, tries to document the limitations of the matching methods for SpanNearQuery.
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          Patch of 25 Sep 2016.
          Compared to the previous patch, this removes the ORDERED_STARTPOS case, because I don't know whether that is needed.
          Also this restores backward compatibility.

          Compared to master, this has:
          Four MatchNear methods, two are the current ones, they are called ORDERED_LAZY and UNORDERED_LAZY, and these are used when the current builder and constructors use a boolean ordered argument.

          The third case is ORDERED_LOOKAHEAD, which is from the patch of 18 August.

          The last case is UNORDERED_STARTPOS, which is simpler than UNORDERED_LAZY, hopefully a little faster, and with better completeness of the result.

          Javadocs for all four cases have been added.

          All test cases from here have been added, and where necessary they have been modified to use ORDERED_LOOKAHEAD and to not do span collection. These tests pass.

          For the last case, UNORDERED_STARTPOS, no test cases have been added yet. This is still to be done. Does anyone have more difficult cases?

          Minor point: the collect() method was moved to the superclass ConjunctionSpans.

          Feedback welcome, especially on the javadocs of SpanNearQuery.MatchNear.

          Instead of adding backtracking methods, it might be better to do counting of input spans in a matching window. I'm hoping that the UNORDERED_STARTPOS case can be extended for that. Any ideas there?

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - Patch of 25 Sep 2016. Compared to the previous patch, this removes the ORDERED_STARTPOS case, because I don't know whether that is needed. Also this restores backward compatibility. Compared to master, this has: Four MatchNear methods, two are the current ones, they are called ORDERED_LAZY and UNORDERED_LAZY, and these are used when the current builder and constructors use a boolean ordered argument. The third case is ORDERED_LOOKAHEAD, which is from the patch of 18 August. The last case is UNORDERED_STARTPOS, which is simpler than UNORDERED_LAZY, hopefully a little faster, and with better completeness of the result. Javadocs for all four cases have been added. All test cases from here have been added, and where necessary they have been modified to use ORDERED_LOOKAHEAD and to not do span collection. These tests pass. For the last case, UNORDERED_STARTPOS, no test cases have been added yet. This is still to be done. Does anyone have more difficult cases? Minor point: the collect() method was moved to the superclass ConjunctionSpans. Feedback welcome, especially on the javadocs of SpanNearQuery.MatchNear. Instead of adding backtracking methods, it might be better to do counting of input spans in a matching window. I'm hoping that the UNORDERED_STARTPOS case can be extended for that. Any ideas there?
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited

          Patch of 24 Sep 2016, work in progress. Edit: superseded on 25 Sep, this can be ignored.

          This introduces SpanNearQuery.MatchNear to choose the matching method.

          The ORDERED_LAZY case is still the patch of 14 August, this should be changed back to the current implementation, and be used to implement ORDERED_LOOKAHEAD.

          This implements MatchNear.UNORDERED_STARTPOS and uses that as the default implementation for the unordered case.
          The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos, which is simpler than the current NearSpansUnordered, there is no SpansCell.
          I'd expect this StartPos implementation to be a little faster, so I also implemented it as default for the unordered case. In only one test case the UNORDERED_LAZY method is needed to pass the test.

          The question is whether it is ok to change the default unordered implementation to only use the span start positions.

          The collect() method is moved to the superclass ConjunctionSpans, this simplification might be done at another issue.

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited Patch of 24 Sep 2016, work in progress. Edit: superseded on 25 Sep, this can be ignored. This introduces SpanNearQuery.MatchNear to choose the matching method. The ORDERED_LAZY case is still the patch of 14 August, this should be changed back to the current implementation, and be used to implement ORDERED_LOOKAHEAD. This implements MatchNear.UNORDERED_STARTPOS and uses that as the default implementation for the unordered case. The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos, which is simpler than the current NearSpansUnordered, there is no SpansCell. I'd expect this StartPos implementation to be a little faster, so I also implemented it as default for the unordered case. In only one test case the UNORDERED_LAZY method is needed to pass the test. The question is whether it is ok to change the default unordered implementation to only use the span start positions. The collect() method is moved to the superclass ConjunctionSpans, this simplification might be done at another issue.
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited

          The idea is to allow full backward compatibility, as well as more matching methods:

          UNORDERED_LAZY is the current unordered,
          UNORDERED_STARTPOS is even simpler, it only uses span start positions, so it should be complete.
          ORDERED_LAZY is the current ordered,
          ORDERED_LOOKAHEAD is in the patch of 14 August 2016,
          ORDERED_STARTPOS also only uses start positions, so it should be complete.

          The complete ORDERED and UNORDERED cases that use start and end positions and need backtracking are left for later.

          Comments?

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited The idea is to allow full backward compatibility, as well as more matching methods: UNORDERED_LAZY is the current unordered, UNORDERED_STARTPOS is even simpler, it only uses span start positions, so it should be complete. ORDERED_LAZY is the current ordered, ORDERED_LOOKAHEAD is in the patch of 14 August 2016, ORDERED_STARTPOS also only uses start positions, so it should be complete. The complete ORDERED and UNORDERED cases that use start and end positions and need backtracking are left for later. Comments?
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited

          I have started on working on a SpanNearQuery that contains this:

            /** Specifies how clauses are to occur near each other in matching documents. */
            public static enum MatchNear {
          
              /** Use this method for clauses that match when they are not ordered,
               * and the slop should be determined between the end and start positions of all clauses.
               * When the subspans vary in length, some matches may not be found.
               */
              UNORDERED_LAZY,
          
              /** Use this method for clauses that match when they are not ordered,
               * and the slop should be determined between the start positions of the first and last matching clauses.
               */
              UNORDERED_STARTPOS,
          
              /** Use this method for clauses that can match when they are ordered and span collection is needed,
               * and the slop should be determined between the end and start positions of the clauses.
               * When the subspans vary in length, some matches may not be found.
               */
              ORDERED_LAZY,
          
              /** Use this method for clauses that can match when they are ordered and span collection is needed,
               * and the slop should be determined between the end and start positions of the clauses.
               * When the subspans vary in length, some matches may not be found,
               * however this method finds more matches than {@link ORDERED_LAZY}.
               */
              ORDERED_LOOKAHEAD,
          
              /** Use this method for clauses that match when they are ordered,
               * and the slop should be determined between the start positions of the first and last matching clauses.
               */
              ORDERED_STARTPOS
            }
          
          
          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited I have started on working on a SpanNearQuery that contains this: /** Specifies how clauses are to occur near each other in matching documents. */ public static enum MatchNear { /** Use this method for clauses that match when they are not ordered, * and the slop should be determined between the end and start positions of all clauses. * When the subspans vary in length, some matches may not be found. */ UNORDERED_LAZY, /** Use this method for clauses that match when they are not ordered, * and the slop should be determined between the start positions of the first and last matching clauses. */ UNORDERED_STARTPOS, /** Use this method for clauses that can match when they are ordered and span collection is needed, * and the slop should be determined between the end and start positions of the clauses. * When the subspans vary in length, some matches may not be found. */ ORDERED_LAZY, /** Use this method for clauses that can match when they are ordered and span collection is needed, * and the slop should be determined between the end and start positions of the clauses. * When the subspans vary in length, some matches may not be found, * however this method finds more matches than {@link ORDERED_LAZY}. */ ORDERED_LOOKAHEAD, /** Use this method for clauses that match when they are ordered, * and the slop should be determined between the start positions of the first and last matching clauses. */ ORDERED_STARTPOS }
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          As to missing matches due to lazy iteration, I'd prefer to add an option to allow choice between current behaviour, the above patch (because I think it is slightly better than previous 4.10 behaviour), one that misses no matches, and perhaps more.
          For example, would anyone like a SpanWindowQuery that only uses span start positions? That would at least allow an easy complete implementation.
          And we need to document the current ordered - no overlap, and non ordered - overlap behaviour.

          To improve scoring consistency, we could start by requiring that span near queries score the same as phrases.
          There is a problem for nested span queries in that current similarities have a tf component over a complete document field, and this tf does not play well with the sloppy frequency for SpanNear over SpanOr. I'd like each term occurrence of a SpanTerm to contribute the same (idf like) weight to a SpanNear, but that can currently not be done because the spans of a SpanOr does not have a weight. So when mixing terms with SpanOr it will be hard to get the same scoring as a boolean Or over PhraseQueries. I don't know how to resolve this, we may have to add something to the similarities for this.
          SpanBoostQuery would only make sense when the individual Spans occrurences can carry a weight.
          I'd prefer span scoring consistency to have its own jira issue(s).

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - As to missing matches due to lazy iteration, I'd prefer to add an option to allow choice between current behaviour, the above patch (because I think it is slightly better than previous 4.10 behaviour), one that misses no matches, and perhaps more. For example, would anyone like a SpanWindowQuery that only uses span start positions? That would at least allow an easy complete implementation. And we need to document the current ordered - no overlap, and non ordered - overlap behaviour. To improve scoring consistency, we could start by requiring that span near queries score the same as phrases. There is a problem for nested span queries in that current similarities have a tf component over a complete document field, and this tf does not play well with the sloppy frequency for SpanNear over SpanOr. I'd like each term occurrence of a SpanTerm to contribute the same (idf like) weight to a SpanNear, but that can currently not be done because the spans of a SpanOr does not have a weight. So when mixing terms with SpanOr it will be hard to get the same scoring as a boolean Or over PhraseQueries. I don't know how to resolve this, we may have to add something to the similarities for this. SpanBoostQuery would only make sense when the individual Spans occrurences can carry a weight. I'd prefer span scoring consistency to have its own jira issue(s).
          Hide
          goller@detego-software.de Christoph Goller added a comment -

          I just found that the LUCENE-2878 work/branch may contain some interesting ideas about scoring and proximity search / Span*Queries.

          Show
          goller@detego-software.de Christoph Goller added a comment - I just found that the LUCENE-2878 work/branch may contain some interesting ideas about scoring and proximity search / Span*Queries.
          Hide
          goller@detego-software.de Christoph Goller added a comment - - edited

          After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one.

          1) Missing Matches caused by lazy iteration:

          I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed some matches.

          *) This ticket demonstrates missing matches for ordered SpanQuery. Documents that should match don't match. This is caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient.

          *) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case.

          *) In the unordered case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in LUCENE-5331 and LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail.

          My current opinion: We should give up lazy iteration for the unordered and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least for the ordered case I am optimistic that I could implement it efficiently.

          2) Inconsistent Scoring of SpanNearQuery

          *) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery.

          *) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength))

          *) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score.

          Show
          goller@detego-software.de Christoph Goller added a comment - - edited After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one. 1) Missing Matches caused by lazy iteration: I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed some matches. *) This ticket demonstrates missing matches for ordered SpanQuery. Documents that should match don't match. This is caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient. *) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case. *) In the unordered case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in LUCENE-5331 and LUCENE-2861 . Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail. My current opinion: We should give up lazy iteration for the unordered and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least for the ordered case I am optimistic that I could implement it efficiently. 2) Inconsistent Scoring of SpanNearQuery *) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery. *) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength)) *) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score.
          Hide
          goller@detego-software.de Christoph Goller added a comment -

          Good idea to try the nested tests from TestSpanCollection for the unordered case. The example from LUCENE-5331 shows the problems of incomplete backtracking (not comparing all combinations of span matches of all subspans) for the unordered case. In the ordered case we only have a problem with spans that have matches of different lenght, in the unorderd case we also see a problem with overlapping span-matches, even if they all have length 1.

          Show
          goller@detego-software.de Christoph Goller added a comment - Good idea to try the nested tests from TestSpanCollection for the unordered case. The example from LUCENE-5331 shows the problems of incomplete backtracking (not comparing all combinations of span matches of all subspans) for the unordered case. In the ordered case we only have a problem with spans that have matches of different lenght, in the unorderd case we also see a problem with overlapping span-matches, even if they all have length 1.
          Hide
          goller@detego-software.de Christoph Goller added a comment - - edited

          Paul's 20160814 patch almost convinced me. Unfortunately, it does not fix the case when an intermediate span has a longer match that reduces overall sloppyness but overlaps with a match of a subsequent span and consequently requires advancing the subsequent span. Here is an example

          Document: w1 w2 w3 w4 w5
          near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5)))

          Add the following code to the end of TestSpanCollection.testNestedNearQuery()

          SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true);
          SpanOrQuery q2234 = new SpanOrQuery(q2, q234);
          SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5"));
          SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true);
          SpanOrQuery q455 = new SpanOrQuery(q45, p5);
                  
          SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 0, true);
          spans = q1q2234q445.createWeight(searcher, false, 1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS);
          assertEquals(0, spans.advance(0));
          

          I think we can only fix it if we get give up lazy iteration. I don't think this is so bad for performance. If we implement a clever caching for positions in spans a complete backtracking would only consist of making a few additional int-comparisons. The expensive operation is iterating over all span positions (IO) and we do this already in advancePosition(Spans, int), aren't we.

          Show
          goller@detego-software.de Christoph Goller added a comment - - edited Paul's 20160814 patch almost convinced me. Unfortunately, it does not fix the case when an intermediate span has a longer match that reduces overall sloppyness but overlaps with a match of a subsequent span and consequently requires advancing the subsequent span. Here is an example Document: w1 w2 w3 w4 w5 near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5))) Add the following code to the end of TestSpanCollection.testNestedNearQuery() SpanNearQuery q234 = new SpanNearQuery( new SpanQuery[]{q2, q3, q4}, 0, true ); SpanOrQuery q2234 = new SpanOrQuery(q2, q234); SpanTermQuery p5 = new SpanTermQuery( new Term(FIELD, "w5" )); SpanNearQuery q45 = new SpanNearQuery( new SpanQuery[]{q4, p5}, 0, true ); SpanOrQuery q455 = new SpanOrQuery(q45, p5); SpanNearQuery q1q2234q445 = new SpanNearQuery( new SpanQuery[]{q1, q2234, q455}, 0, true ); spans = q1q2234q445.createWeight(searcher, false , 1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS); assertEquals(0, spans.advance(0)); I think we can only fix it if we get give up lazy iteration. I don't think this is so bad for performance. If we implement a clever caching for positions in spans a complete backtracking would only consist of making a few additional int-comparisons. The expensive operation is iterating over all span positions (IO) and we do this already in advancePosition(Spans, int), aren't we.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          The cause of this is different, I think (ordered vs unordered). However, LUCENE-5331 offers another test case that is still failing for nested SpanQueries.

          Based on this, I found that if you switch Paul Elschot's unit tests in the 20160814 patch to unordered SpanNear's, the unit tests still fail.

          This is not surprising given that Paul's patch focuses on NearSpansOrdered. In short, please don't take this as a complaint.

          Thank you, all, for your work on this!

          Show
          tallison@mitre.org Tim Allison added a comment - The cause of this is different, I think (ordered vs unordered). However, LUCENE-5331 offers another test case that is still failing for nested SpanQueries. Based on this, I found that if you switch Paul Elschot 's unit tests in the 20160814 patch to unordered SpanNear's, the unit tests still fail. This is not surprising given that Paul's patch focuses on NearSpansOrdered. In short, please don't take this as a complaint. Thank you, all, for your work on this!
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          Patch of 14 August 2016.
          This uses the span position queue from master, and adds the above w1..w5 test case with slop 1.

          This reintroduces a shrink in shrinkToDecreaseSlop:

            /** The subSpans are ordered in the same doc and matchSlop is too big.
             * Try and decrease the slop by calling nextStartPosition() on all subSpans except the last one in reverse order.
             * Return true iff an ordered match was found with small enough slop.
             */
            private boolean shrinkToDecreaseSlop() throws IOException {
            ...
            }
          

          This also adds a FIXME to collect():

            /** FIXME: the subspans may be after the current match. */
          

          Payload collection can still be added, I left that to be done.

          It only fails the test case TestSpanCollection.testNestedNearQuery reporting that a term was not collected properly, the other search tests pass.

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - Patch of 14 August 2016. This uses the span position queue from master, and adds the above w1..w5 test case with slop 1. This reintroduces a shrink in shrinkToDecreaseSlop: /** The subSpans are ordered in the same doc and matchSlop is too big. * Try and decrease the slop by calling nextStartPosition() on all subSpans except the last one in reverse order. * Return true iff an ordered match was found with small enough slop. */ private boolean shrinkToDecreaseSlop() throws IOException { ... } This also adds a FIXME to collect(): /** FIXME: the subspans may be after the current match. */ Payload collection can still be added, I left that to be done. It only fails the test case TestSpanCollection.testNestedNearQuery reporting that a term was not collected properly, the other search tests pass.
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          To complete the picture here for the ordered case, shrinkToAfterShortestMatch() was replaced by lazy iteration at LUCENE-6537. Some points from there:

          • Lazy iteration should return the same document matches, but it will return some extra Span hits within each document, so scores might be different.
          • Repeated matches from non nested ordered span near occur only when the first term repeats and there is enough slop; for query t1 t2 with slop 1:
            t1 t1 t2 matches twice,
            t1 t2 t2 matches once.

          Nevertheless, from the gene research example above one can see that the current lazy iteration misses a document that used to match.

          So, is it possible to change the current implementation so that it matches more documents correctly, while still being lazy?
          Here lazy means that all subspans are only moved forward, and a test for a match is only done after at least one subspans was moved forward.

          The current implementation is based on the first subspans moving forward followed by a stretchToOrder().
          After that, as long as there is no match (i.e. too much slop), we could add moving each of the intermediate subspans forward until the order is lost.
          (This would be somewhat similar to shrinkToAfterShortestMatch(), but based on the actual slop, and not on the length on the match.)

          Would that help?
          When so, in which order should the intermediate spans be moved forward? shrinkToAfterShortestMatch() used to work backwards, but forwards could also be done.

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - To complete the picture here for the ordered case, shrinkToAfterShortestMatch() was replaced by lazy iteration at LUCENE-6537 . Some points from there: Lazy iteration should return the same document matches, but it will return some extra Span hits within each document, so scores might be different. Repeated matches from non nested ordered span near occur only when the first term repeats and there is enough slop; for query t1 t2 with slop 1: t1 t1 t2 matches twice, t1 t2 t2 matches once. Nevertheless, from the gene research example above one can see that the current lazy iteration misses a document that used to match. So, is it possible to change the current implementation so that it matches more documents correctly, while still being lazy? Here lazy means that all subspans are only moved forward, and a test for a match is only done after at least one subspans was moved forward. The current implementation is based on the first subspans moving forward followed by a stretchToOrder(). After that, as long as there is no match (i.e. too much slop), we could add moving each of the intermediate subspans forward until the order is lost. (This would be somewhat similar to shrinkToAfterShortestMatch(), but based on the actual slop, and not on the length on the match.) Would that help? When so, in which order should the intermediate spans be moved forward? shrinkToAfterShortestMatch() used to work backwards, but forwards could also be done.
          Hide
          goller@detego-software.de Christoph Goller added a comment -

          The whole idea of the patch is to change the order of the matches returned by SpanOrQuery.

          SpanTermQuery q2 = new SpanTermQuery(new Term(FIELD, "w2"));
          SpanTermQuery q3 = new SpanTermQuery(new Term(FIELD, "w3"));
          SpanNearQuery q23 = new SpanNearQuery(new SpanQuery[]{q2, q3}, 0, true);
          SpanOrQuery q223 = new SpanOrQuery(q2, q23);
          

          For a document containing "w1 w2 w3 w4" query q223 now returns as first match "w2 w3" (the longer one) and then "w2" while formerly it was the other way round. Both matches have the same start position, but different end positions and the contract about spans says that if start positions equal we first get the match with the lower end position (Javadoc of spans).

          Show
          goller@detego-software.de Christoph Goller added a comment - The whole idea of the patch is to change the order of the matches returned by SpanOrQuery. SpanTermQuery q2 = new SpanTermQuery( new Term(FIELD, "w2" )); SpanTermQuery q3 = new SpanTermQuery( new Term(FIELD, "w3" )); SpanNearQuery q23 = new SpanNearQuery( new SpanQuery[]{q2, q3}, 0, true ); SpanOrQuery q223 = new SpanOrQuery(q2, q23); For a document containing "w1 w2 w3 w4" query q223 now returns as first match "w2 w3" (the longer one) and then "w2" while formerly it was the other way round. Both matches have the same start position, but different end positions and the contract about spans says that if start positions equal we first get the match with the lower end position (Javadoc of spans).
          Hide
          dsmiley David Smiley added a comment -

          I consider the patch as potentially dangerous since SpanOrQuery with the patch provides an ordering of span-matches that differs form the general contract that holds for all spans.

          Could you elaborate on what ordering could now happen with Alan's patch that differs from the general contract?

          Show
          dsmiley David Smiley added a comment - I consider the patch as potentially dangerous since SpanOrQuery with the patch provides an ordering of span-matches that differs form the general contract that holds for all spans. Could you elaborate on what ordering could now happen with Alan's patch that differs from the general contract?
          Hide
          goller@detego-software.de Christoph Goller added a comment - - edited

          After thoroughly looking into SpanQueries my conclusion is, that we have a fundamental problem in the implementation of SpanNearQuery. The problem is not new, it probably existed already in the first version of SpanQueries which as far as I know were implemented by Doug Cutting himself. I remember some attempts to describe in which cases SpanQueries work correctly and in which they do not (discussions about overlapping), but those explanations and definitions were never completely convincing for me.

          My best guess: NearSpansOrdered and NearSpansUnordered currently are only correct if for each clause of the SpanQuery we can guarantee, that all its matches have the same length. In this case it is clear that (for the ordered case) if a match is too long (sloppy) we can skip to the first clause and call nextPosition. No alternative matches of intermediate clauses could improve the overall match. If we have clauses with varying match length (SpanOr or SpanNear with sloppyness) we would have to backtrack to intermediate clauses and check whether there are e.g. longer matches that could reduce the overall match length. Pauls last test case shows that even a match of the second clause that advances its position can reduce the overall lenght if it is longer himself. A match of an intermediate clause at an advanced position could be considerably shorter than its first match requiring a reset of the spans of following clauses. To my opinion this bug can only be fixed by implementing a backtracking search on the subspans that also requires a limited possibilitxy to reposition Spans to previous positions.

          By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 provided a kind of backtracking which was the reason why my queries worked in elasticsearch 1.7.x. However, I think the implementation also did not solve all cases:

            /** The subSpans are ordered in the same doc, so there is a possible match.
             * Compute the slop while making the match as short as possible by advancing
             * all subSpans except the last one in reverse order.
             */
            private boolean shrinkToAfterShortestMatch() throws IOException {
              matchStart = subSpans[subSpans.length - 1].start();
              matchEnd = subSpans[subSpans.length - 1].end();
              Set<byte[]> possibleMatchPayloads = new HashSet<>();
              if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
                possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
              }
          
              Collection<byte[]> possiblePayload = null;
              
              int matchSlop = 0;
              int lastStart = matchStart;
              int lastEnd = matchEnd;
              for (int i = subSpans.length - 2; i >= 0; i--) {
                Spans prevSpans = subSpans[i];
                if (collectPayloads && prevSpans.isPayloadAvailable()) {
                  Collection<byte[]> payload = prevSpans.getPayload();
                  possiblePayload = new ArrayList<>(payload.size());
                  possiblePayload.addAll(payload);
                }
                
                int prevStart = prevSpans.start();
                int prevEnd = prevSpans.end();
                while (true) { // Advance prevSpans until after (lastStart, lastEnd)
                  if (! prevSpans.next()) {
                    inSameDoc = false;
                    more = false;
                    break; // Check remaining subSpans for final match.
                  } else if (matchDoc != prevSpans.doc()) {
                    inSameDoc = false; // The last subSpans is not advanced here.
                    break; // Check remaining subSpans for last match in this document.
                  } else {
                    int ppStart = prevSpans.start();
                    int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
                    if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
                      break; // Check remaining subSpans.
                    } else { // prevSpans still before (lastStart, lastEnd)
                      prevStart = ppStart;
                      prevEnd = ppEnd;
                      if (collectPayloads && prevSpans.isPayloadAvailable()) {
                        Collection<byte[]> payload = prevSpans.getPayload();
                        possiblePayload = new ArrayList<>(payload.size());
                        possiblePayload.addAll(payload);
                      }
                    }
                  }
                }
          
                if (collectPayloads && possiblePayload != null) {
                  possibleMatchPayloads.addAll(possiblePayload);
                }
                
                assert prevStart <= matchStart;
                if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
                  matchSlop += (matchStart - prevEnd);
                }
          
                /* Do not break on (matchSlop > allowedSlop) here to make sure
                 * that subSpans[0] is advanced after the match, if any.
                 */
                matchStart = prevStart;
                lastStart = prevStart;
                lastEnd = prevEnd;
              }
              
              boolean match = matchSlop <= allowedSlop;
              
              if(collectPayloads && match && possibleMatchPayloads.size() > 0) {
                matchPayload.addAll(possibleMatchPayloads);
              }
          
              return match; // ordered and allowed slop
            }
            
          

          Unfortunately the patch provided by Alan does not solve the problem. It only reorders span-matches of a SpanOrQuery in a special case, since it cannot control the order of span-matches of its subspans. I consider the patch as potentially dangerous since SpanOrQuery with the patch provides an ordering of span-matches that differs form the general contract that holds for all spans.

          Show
          goller@detego-software.de Christoph Goller added a comment - - edited After thoroughly looking into SpanQueries my conclusion is, that we have a fundamental problem in the implementation of SpanNearQuery. The problem is not new, it probably existed already in the first version of SpanQueries which as far as I know were implemented by Doug Cutting himself. I remember some attempts to describe in which cases SpanQueries work correctly and in which they do not (discussions about overlapping), but those explanations and definitions were never completely convincing for me. My best guess: NearSpansOrdered and NearSpansUnordered currently are only correct if for each clause of the SpanQuery we can guarantee, that all its matches have the same length. In this case it is clear that (for the ordered case) if a match is too long (sloppy) we can skip to the first clause and call nextPosition. No alternative matches of intermediate clauses could improve the overall match. If we have clauses with varying match length (SpanOr or SpanNear with sloppyness) we would have to backtrack to intermediate clauses and check whether there are e.g. longer matches that could reduce the overall match length. Pauls last test case shows that even a match of the second clause that advances its position can reduce the overall lenght if it is longer himself. A match of an intermediate clause at an advanced position could be considerably shorter than its first match requiring a reset of the spans of following clauses. To my opinion this bug can only be fixed by implementing a backtracking search on the subspans that also requires a limited possibilitxy to reposition Spans to previous positions. By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 provided a kind of backtracking which was the reason why my queries worked in elasticsearch 1.7.x. However, I think the implementation also did not solve all cases: /** The subSpans are ordered in the same doc, so there is a possible match. * Compute the slop while making the match as short as possible by advancing * all subSpans except the last one in reverse order. */ private boolean shrinkToAfterShortestMatch() throws IOException { matchStart = subSpans[subSpans.length - 1].start(); matchEnd = subSpans[subSpans.length - 1].end(); Set< byte []> possibleMatchPayloads = new HashSet<>(); if (subSpans[subSpans.length - 1].isPayloadAvailable()) { possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload()); } Collection< byte []> possiblePayload = null ; int matchSlop = 0; int lastStart = matchStart; int lastEnd = matchEnd; for ( int i = subSpans.length - 2; i >= 0; i--) { Spans prevSpans = subSpans[i]; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection< byte []> payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } int prevStart = prevSpans.start(); int prevEnd = prevSpans.end(); while ( true ) { // Advance prevSpans until after (lastStart, lastEnd) if (! prevSpans.next()) { inSameDoc = false ; more = false ; break ; // Check remaining subSpans for final match. } else if (matchDoc != prevSpans.doc()) { inSameDoc = false ; // The last subSpans is not advanced here. break ; // Check remaining subSpans for last match in this document. } else { int ppStart = prevSpans.start(); int ppEnd = prevSpans.end(); // Cannot avoid invoking .end() if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) { break ; // Check remaining subSpans. } else { // prevSpans still before (lastStart, lastEnd) prevStart = ppStart; prevEnd = ppEnd; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection< byte []> payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } } } } if (collectPayloads && possiblePayload != null ) { possibleMatchPayloads.addAll(possiblePayload); } assert prevStart <= matchStart; if (matchStart > prevEnd) { // Only non overlapping spans add to slop. matchSlop += (matchStart - prevEnd); } /* Do not break on (matchSlop > allowedSlop) here to make sure * that subSpans[0] is advanced after the match, if any. */ matchStart = prevStart; lastStart = prevStart; lastEnd = prevEnd; } boolean match = matchSlop <= allowedSlop; if (collectPayloads && match && possibleMatchPayloads.size() > 0) { matchPayload.addAll(possibleMatchPayloads); } return match; // ordered and allowed slop } Unfortunately the patch provided by Alan does not solve the problem. It only reorders span-matches of a SpanOrQuery in a special case, since it cannot control the order of span-matches of its subspans. I consider the patch as potentially dangerous since SpanOrQuery with the patch provides an ordering of span-matches that differs form the general contract that holds for all spans.
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          The patch of 22:53 hrs works as expected.

          This additional test case fails on doc 0, "w1 w2 w3 w4 w5":

            @Test
            public void testNestedOrQuerySlop() throws IOException {
              SpanNearQuery snq = SpanNearQuery.newOrderedNearQuery(FIELD)
                  .setSlop(1)
                  .addClause(new SpanTermQuery(new Term(FIELD, "w1")))
                  .addClause(new SpanOrQuery(
                      new SpanTermQuery(new Term(FIELD, "w2")),
                      SpanNearQuery.newOrderedNearQuery(FIELD)
                          .addClause(new SpanTermQuery(new Term(FIELD, "w3")))
                          .addClause(new SpanTermQuery(new Term(FIELD, "w4")))
                          .build()
                  ))
                  .addClause(new SpanTermQuery(new Term(FIELD, "w5")))
                  .build();
          
              Spans spans = snq.createWeight(searcher, false, 1).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS);
              assertEquals(0, spans.advance(0));
              assertEquals(Spans.NO_MORE_DOCS, spans.nextDoc());
            }
          

          and it passes when using setSlop(2) instead of setSlop(1).

          Perhaps we should also add the above test case with setSlop(2) and a comment that it actually should pass with setSlop(1).

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - The patch of 22:53 hrs works as expected. This additional test case fails on doc 0, "w1 w2 w3 w4 w5": @Test public void testNestedOrQuerySlop() throws IOException { SpanNearQuery snq = SpanNearQuery.newOrderedNearQuery(FIELD) .setSlop(1) .addClause( new SpanTermQuery( new Term(FIELD, "w1" ))) .addClause( new SpanOrQuery( new SpanTermQuery( new Term(FIELD, "w2" )), SpanNearQuery.newOrderedNearQuery(FIELD) .addClause( new SpanTermQuery( new Term(FIELD, "w3" ))) .addClause( new SpanTermQuery( new Term(FIELD, "w4" ))) .build() )) .addClause( new SpanTermQuery( new Term(FIELD, "w5" ))) .build(); Spans spans = snq.createWeight(searcher, false , 1).getSpans(searcher.getIndexReader().leaves().get(0), SpanWeight.Postings.POSITIONS); assertEquals(0, spans.advance(0)); assertEquals(Spans.NO_MORE_DOCS, spans.nextDoc()); } and it passes when using setSlop(2) instead of setSlop(1). Perhaps we should also add the above test case with setSlop(2) and a comment that it actually should pass with setSlop(1).
          Hide
          romseygeek Alan Woodward added a comment -

          Patch against master.

          Show
          romseygeek Alan Woodward added a comment - Patch against master.
          Hide
          romseygeek Alan Woodward added a comment -

          Ah, looks like I made the patch against 6x, will re-up against master.

          The specific problem here isn't to do with look-aheads, it's with overlapping Spans within a SpanOr when it's not the leading Span for a near query. If two clauses start at the same position, but one is longer than the other, then we need to check the longer Span first, as it will cover both cases; if the shorter Span is checked first and fails, then the leading Span will be incremented, rather than the Or.

          Show
          romseygeek Alan Woodward added a comment - Ah, looks like I made the patch against 6x, will re-up against master. The specific problem here isn't to do with look-aheads, it's with overlapping Spans within a SpanOr when it's not the leading Span for a near query. If two clauses start at the same position, but one is longer than the other, then we need to check the longer Span first, as it will cover both cases; if the shorter Span is checked first and fails, then the leading Span will be incremented, rather than the Or.
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          I applied the patch to master commit b3505298a5bef76ff83b269bf87a179d027da849 .
          With the patch applied, TestSpanCollection does not compile, there is no applicable createWeight() method a few times.
          Probably there was a recent conflict there, so I could not actually verify the patch.

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - I applied the patch to master commit b3505298a5bef76ff83b269bf87a179d027da849 . With the patch applied, TestSpanCollection does not compile, there is no applicable createWeight() method a few times. Probably there was a recent conflict there, so I could not actually verify the patch.
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited

          The patch changes the order in SpanPositionQueue which is used by SpanOr.
          This works to pass the test case, and it does not increase complexity.

          I think the problem is in NearSpansOrdered.stretchToOrder() which only does this:

          matchEnd = subSpans[subSpans.length - 1].endPosition();

          What is should also do is lookahead to see whether there is an ordered match with a smaller slop.

          It could be that there still is a failing case with a nested SpanOr, possibly containing another nested SpanNear, but I'm not sure, this is tricky.

          Since looking ahead increases the complexity (normal case runtime) I'd prefer to have the patch applied now, and see what the future brings.

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - - edited The patch changes the order in SpanPositionQueue which is used by SpanOr. This works to pass the test case, and it does not increase complexity. I think the problem is in NearSpansOrdered.stretchToOrder() which only does this: matchEnd = subSpans[subSpans.length - 1].endPosition(); What is should also do is lookahead to see whether there is an ordered match with a smaller slop. It could be that there still is a failing case with a nested SpanOr, possibly containing another nested SpanNear, but I'm not sure, this is tricky. Since looking ahead increases the complexity (normal case runtime) I'd prefer to have the patch applied now, and see what the future brings.
          Hide
          romseygeek Alan Woodward added a comment -

          Patch with fix, incorporating Christoph's test - will commit tomorrow.

          Show
          romseygeek Alan Woodward added a comment - Patch with fix, incorporating Christoph's test - will commit tomorrow.
          Hide
          romseygeek Alan Woodward added a comment -

          I think this is a bug in SpanPositionQueue, where longer Spans should be sorted before shorter ones at the same position - will dig and see if I can get a fix in shortly.

          Show
          romseygeek Alan Woodward added a comment - I think this is a bug in SpanPositionQueue, where longer Spans should be sorted before shorter ones at the same position - will dig and see if I can get a fix in shortly.
          Hide
          goller@detego-software.de Christoph Goller added a comment - - edited

          Please find attatched an extended TestSpanCollection.java for Lucene 6.1 that shows the problem.

          Show
          goller@detego-software.de Christoph Goller added a comment - - edited Please find attatched an extended TestSpanCollection.java for Lucene 6.1 that shows the problem.

            People

            • Assignee:
              romseygeek Alan Woodward
              Reporter:
              goller@detego-software.de Christoph Goller
            • Votes:
              5 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development