Lucene - Core
  1. Lucene - Core
  2. LUCENE-6308

Spans to extend DocIdSetIterator; was: SpansEnum, deprecate Spans

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 5.2, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      An alternative for Spans that looks more like PositionsEnum and adds two phase doc id iteration

      1. LUCENE-6308.patch
        408 kB
        Paul Elschot
      2. LUCENE-6308.patch
        349 kB
        Paul Elschot
      3. LUCENE-6308.patch
        341 kB
        Paul Elschot
      4. LUCENE-6308.patch
        97 kB
        Paul Elschot
      5. LUCENE-6308.patch
        84 kB
        Paul Elschot
      6. LUCENE-6308.patch
        31 kB
        Paul Elschot
      7. LUCENE-6308-changeapi.patch
        180 kB
        Paul Elschot
      8. LUCENE-6308-changeapi.patch
        176 kB
        Paul Elschot
      9. LUCENE-6308-changeapi.patch
        176 kB
        Paul Elschot
      10. LUCENE-6308-changeapi.patch
        170 kB
        Paul Elschot
      11. LUCENE-6308-changeapi.patch
        170 kB
        Paul Elschot
      12. LUCENE-6308-changeapi.patch
        165 kB
        Robert Muir

        Activity

        Hide
        Paul Elschot added a comment -

        This avoids the problem of Spans going to the next document while at the end of scoring the current doc, for LUCENE-6250.

        I tried to do this by adding the method Spans.nextAtCurrentDoc, but that was too invasive.

        The current patch is incomplete, it adds SpansEnum as a subclass of DocIdSetIterator with methods nextStartPosition and endPosition.
        It reimplements TermSpans as TermSpansEnum, and also adds SpansEnumScorer, SpansEnumWeight etc.
        The real work is in NearSpans, still to be done.

        The patch also deprecates Spans, SpanScorer, SpanWeight and SpanTermQuery, all in favour of added ...SpansEnum... counterparts.

        Since SpansEnum inherits from DocIdSetIterator, just like Scorer, the SpansEnumScorer is actually simpler than SpanScorer.

        Show
        Paul Elschot added a comment - This avoids the problem of Spans going to the next document while at the end of scoring the current doc, for LUCENE-6250 . I tried to do this by adding the method Spans.nextAtCurrentDoc, but that was too invasive. The current patch is incomplete, it adds SpansEnum as a subclass of DocIdSetIterator with methods nextStartPosition and endPosition. It reimplements TermSpans as TermSpansEnum, and also adds SpansEnumScorer, SpansEnumWeight etc. The real work is in NearSpans, still to be done. The patch also deprecates Spans, SpanScorer, SpanWeight and SpanTermQuery, all in favour of added ...SpansEnum... counterparts. Since SpansEnum inherits from DocIdSetIterator, just like Scorer, the SpansEnumScorer is actually simpler than SpanScorer.
        Hide
        Paul Elschot added a comment -

        TermSpansEnum only.

        Show
        Paul Elschot added a comment - TermSpansEnum only.
        Hide
        Paul Elschot added a comment -

        This could also be an alternative for LUCENE-2878

        Show
        Paul Elschot added a comment - This could also be an alternative for LUCENE-2878
        Hide
        Robert Muir added a comment -

        I really like the idea of bringing spans "up to speed" to support approximations and other features.

        I think this approach could fix performance issues with spans, because e.g. then they are able to intersect with nextDoc()/advance() without implicitly reading position(s).

        This could be a really nice step. if we get it in shape, we could consider just "fixing" the spans classes to use it, with api breaks. These wouldnt really impact most users, only people writing custom queries. And those people probably want to do it "the fast way"...

        Show
        Robert Muir added a comment - I really like the idea of bringing spans "up to speed" to support approximations and other features. I think this approach could fix performance issues with spans, because e.g. then they are able to intersect with nextDoc()/advance() without implicitly reading position(s). This could be a really nice step. if we get it in shape, we could consider just "fixing" the spans classes to use it, with api breaks. These wouldnt really impact most users, only people writing custom queries. And those people probably want to do it "the fast way"...
        Hide
        Paul Elschot added a comment -

        if we get it in shape

        The idea was to have this ready, including two phase iteration, for 6.0, but I did not yet find the possibility to indicate that in jira.

        with api breaks

        Developing it this way has the advantage that nothing breaks, and that testing is easy by duplicating the test code and then renaming the tested queries.
        The disadvantage is that to see the actual code changes one needs a diff over a rename.

        I have positional join (custom) queries at LUCENE-5627 and these would also benefit. I would prefer to have a deprecation time window to use SpansEnum there.
        Also renaming is simple enough to take the advantage.

        Anyway, once this is ready we can still decide to use the existing names and break the api.

        Show
        Paul Elschot added a comment - if we get it in shape The idea was to have this ready, including two phase iteration, for 6.0, but I did not yet find the possibility to indicate that in jira. with api breaks Developing it this way has the advantage that nothing breaks, and that testing is easy by duplicating the test code and then renaming the tested queries. The disadvantage is that to see the actual code changes one needs a diff over a rename. I have positional join (custom) queries at LUCENE-5627 and these would also benefit. I would prefer to have a deprecation time window to use SpansEnum there. Also renaming is simple enough to take the advantage. Anyway, once this is ready we can still decide to use the existing names and break the api.
        Hide
        Paul Elschot added a comment -

        I just had a quick look at the latest patch at LUCENE-2878, "aka. nuke spans".
        There is a lot more going on there so this maybe a small step, but this is not an alternative as I said earlier.

        Show
        Paul Elschot added a comment - I just had a quick look at the latest patch at LUCENE-2878 , "aka. nuke spans". There is a lot more going on there so this maybe a small step, but this is not an alternative as I said earlier.
        Hide
        Paul Elschot added a comment -

        Patch of 20150304.

        Adds SpansEnumNotQuery, SpansEnumOrQuery, SpansEnumMultiTermQueryWrapper and test cases. Still fails a test in TestBasics for SpansEnumNot.

        Includes LUCENE-6338.

        Show
        Paul Elschot added a comment - Patch of 20150304. Adds SpansEnumNotQuery, SpansEnumOrQuery, SpansEnumMultiTermQueryWrapper and test cases. Still fails a test in TestBasics for SpansEnumNot. Includes LUCENE-6338 .
        Hide
        Paul Elschot added a comment -

        Still to do: FieldMaskingSpanQuery, SpanNearQuery with NearSpansOrdered/Unordered, SpanFirstQuery, SpanPositionRangeQuery, and SpanPayloadCheckQuery.

        Show
        Paul Elschot added a comment - Still to do: FieldMaskingSpanQuery, SpanNearQuery with NearSpansOrdered/Unordered, SpanFirstQuery, SpanPositionRangeQuery, and SpanPayloadCheckQuery.
        Hide
        Paul Elschot added a comment -

        Even though this has another goal (avoiding use of positions as much as possible in spans), the SpansEnum here really is very similar to the PositionInterval iterator at LUCENE-2878. The main difference is that there is some state (begin/end positions) at LUCENE-2878, which is only available in iterating/access methods in SpansEnum here:

          /**
           * Returns the next start position for the current doc.
           * There is always at least one start/end position per doc.
           * After the last start/end position at the current doc this returns {@link NO_MORE_POSITIONS}.
           */
          public abstract int nextStartPosition() throws IOException;
        
          /** Returns the end position for the current begin position. */
          public abstract int endPosition();
         
        

        Since the code here is still very young and has this similarity, I would prefer the code here to somehow converge to LUCENE-2878.
        However, the state of the begin/end positions in LUCENE-2878 does not seem to allow convergence. Any suggestions?

        Show
        Paul Elschot added a comment - Even though this has another goal (avoiding use of positions as much as possible in spans), the SpansEnum here really is very similar to the PositionInterval iterator at LUCENE-2878 . The main difference is that there is some state (begin/end positions) at LUCENE-2878 , which is only available in iterating/access methods in SpansEnum here: /** * Returns the next start position for the current doc. * There is always at least one start/end position per doc. * After the last start/end position at the current doc this returns {@link NO_MORE_POSITIONS}. */ public abstract int nextStartPosition() throws IOException; /** Returns the end position for the current begin position. */ public abstract int endPosition(); Since the code here is still very young and has this similarity, I would prefer the code here to somehow converge to LUCENE-2878 . However, the state of the begin/end positions in LUCENE-2878 does not seem to allow convergence. Any suggestions?
        Hide
        Paul Elschot added a comment -

        Patch of 10 March 2015. Same as previous one, but this one passes ant test -Dtestcase=TestBasics and Test*Spans*.

        Show
        Paul Elschot added a comment - Patch of 10 March 2015. Same as previous one, but this one passes ant test -Dtestcase=TestBasics and Test*Spans*.
        Hide
        Paul Elschot added a comment -

        I'm slowly progressing with this.
        There is one change in the test code in TestSpans that nicely shows in a few lines what this is all about.
        It is the code for counting Spans and for counting SpansEnum (which inherits from DocIdSetIterator).
        Both are produced by a Span Not query:

          private int spanCount(String include, String exclude, int pre, int post) throws IOException {
            SpanTermQuery iq = new SpanTermQuery(new Term(field, include));
            SpanTermQuery eq = new SpanTermQuery(new Term(field, exclude));
            SpanNotQuery snq = new SpanNotQuery(iq, eq, pre, post);
            Spans spans = getSpansFromQuery(snq);
            
            int i = 0;
            while (spans.next()){
              i++;
            }
            return i;
          }
          
          private int spansEnumCount(String include, String exclude, int pre, int post) throws IOException {
            SpansEnumTermQuery iq = new SpansEnumTermQuery(new Term(field, include));
            SpansEnumTermQuery eq = new SpansEnumTermQuery(new Term(field, exclude));
            SpansEnumNotQuery senq = new SpansEnumNotQuery(iq, eq, pre, post);
            SpansEnum spans = getSpansEnumFromQuery(senq);
            
            int i = 0;
            if (spans != null) {
              while (spans.nextDoc() != SpansEnum.NO_MORE_DOCS){
                while (spans.nextStartPosition() != SpansEnum.NO_MORE_POSITIONS) {
                  i++;
                }
              }
            }
            return i;
          }
        
        Show
        Paul Elschot added a comment - I'm slowly progressing with this. There is one change in the test code in TestSpans that nicely shows in a few lines what this is all about. It is the code for counting Spans and for counting SpansEnum (which inherits from DocIdSetIterator). Both are produced by a Span Not query: private int spanCount( String include, String exclude, int pre, int post) throws IOException { SpanTermQuery iq = new SpanTermQuery( new Term(field, include)); SpanTermQuery eq = new SpanTermQuery( new Term(field, exclude)); SpanNotQuery snq = new SpanNotQuery(iq, eq, pre, post); Spans spans = getSpansFromQuery(snq); int i = 0; while (spans.next()){ i++; } return i; } private int spansEnumCount( String include, String exclude, int pre, int post) throws IOException { SpansEnumTermQuery iq = new SpansEnumTermQuery( new Term(field, include)); SpansEnumTermQuery eq = new SpansEnumTermQuery( new Term(field, exclude)); SpansEnumNotQuery senq = new SpansEnumNotQuery(iq, eq, pre, post); SpansEnum spans = getSpansEnumFromQuery(senq); int i = 0; if (spans != null ) { while (spans.nextDoc() != SpansEnum.NO_MORE_DOCS){ while (spans.nextStartPosition() != SpansEnum.NO_MORE_POSITIONS) { i++; } } } return i; }
        Hide
        Paul Elschot added a comment - - edited

        Patch of 19 March 2015, passes all tests.

        There is a lot of almost duplicated code, especially the new tests for SpansEnum that were basically copied and converted from the Spans tests. However, this was worthwhile, because these tests caught a few subtle bugs in the SpansEnum implementations.

        This uses ConjunctionDISI for SpansEnum near queries, so two phase should be doable.

        Renamed TestBasics to TestSpanBasics for easier testing.

        Also deprecates MultiSpansWrapper in favour of TestSpans.getSpansEnumForQuery which has a smaller implementation.

        Show
        Paul Elschot added a comment - - edited Patch of 19 March 2015, passes all tests. There is a lot of almost duplicated code, especially the new tests for SpansEnum that were basically copied and converted from the Spans tests. However, this was worthwhile, because these tests caught a few subtle bugs in the SpansEnum implementations. This uses ConjunctionDISI for SpansEnum near queries, so two phase should be doable. Renamed TestBasics to TestSpanBasics for easier testing. Also deprecates MultiSpansWrapper in favour of TestSpans.getSpansEnumForQuery which has a smaller implementation.
        Hide
        Paul Elschot added a comment -

        Patch of 21 March 2015, adds two phase doc id set iteration.
        Also add SpansEnum test in TestPositionIncrement.

        Show
        Paul Elschot added a comment - Patch of 21 March 2015, adds two phase doc id set iteration. Also add SpansEnum test in TestPositionIncrement.
        Hide
        Paul Elschot added a comment -

        The patch of 21 March 2015 does not have a SpansEnum version of PayloadNearQuery, it does contain PayloadSpansEnumUtil.

        Show
        Paul Elschot added a comment - The patch of 21 March 2015 does not have a SpansEnum version of PayloadNearQuery, it does contain PayloadSpansEnumUtil.
        Hide
        Paul Elschot added a comment - - edited

        Patch of 22 March 2015. Compared to previous patch, this also covers the search.payloads package and has better javadocs.

        This passes ant precommit.

        The two phase doc id iteration in here could be good but I'm not sure.
        It is in the NearSpansEnum and ConjunctionDISI classes.
        Could someone take a look at these to check?

        The payload loading really needs improvement, especially for the ordered near case, but that is another issue.

        Show
        Paul Elschot added a comment - - edited Patch of 22 March 2015. Compared to previous patch, this also covers the search.payloads package and has better javadocs. This passes ant precommit. The two phase doc id iteration in here could be good but I'm not sure. It is in the NearSpansEnum and ConjunctionDISI classes. Could someone take a look at these to check? The payload loading really needs improvement, especially for the ordered near case, but that is another issue.
        Hide
        Robert Muir added a comment -

        I really like this patch, it means spans have real cost() support, two-phase iteration, etc.

        I created an alternative version of it, where we just break the low level Spans api with your changes and we don't have to deprecate all the spans (since you have replacements for every query).

        So this version is less invasive and just "fixes" spans with your new code.

        I think this is ok? The classes are documented expert, and for most users, just using the queries we have, there are no changes that will impact them (its all internals). Queries will just get faster. Otherwise they have to change parsers and so on to generate the new queries.

        Show
        Robert Muir added a comment - I really like this patch, it means spans have real cost() support, two-phase iteration, etc. I created an alternative version of it, where we just break the low level Spans api with your changes and we don't have to deprecate all the spans (since you have replacements for every query). So this version is less invasive and just "fixes" spans with your new code. I think this is ok? The classes are documented expert, and for most users, just using the queries we have, there are no changes that will impact them (its all internals). Queries will just get faster. Otherwise they have to change parsers and so on to generate the new queries.
        Hide
        Robert Muir added a comment -

        By the way, at least i think this exercise may have found a bug. Now that new span logic is used everywhere, I hit some test failures in lucene/queryparser that look like this:

           [junit4] Suite: org.apache.lucene.queryparser.surround.query.Test03Distance
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=Test03Distance -Dtests.method=test3Example01 -Dtests.seed=87854C52BC77D8E3 -Dtests.locale=mt -Dtests.timezone=PRC -Dtests.asserts=true -Dtests.file.encoding=UTF-8
           [junit4] ERROR   0.09s J2 | Test03Distance.test3Example01 <<<
           [junit4]    > Throwable #1: java.lang.NullPointerException
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([87854C52BC77D8E3:7DCE4F0323A56FA5]:0)
           [junit4]    > 	at org.apache.lucene.search.spans.NearSpansUnordered.startPosition(NearSpansUnordered.java:213)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanOrQuery$SpanQueue.lessThan(SpanOrQuery.java:155)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanOrQuery$SpanQueue.lessThan(SpanOrQuery.java:147)
           [junit4]    > 	at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:258)
           [junit4]    > 	at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:135)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanOrQuery.getSpans(SpanOrQuery.java:187)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanNearQuery.getSpans(SpanNearQuery.java:128)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanNearQuery.getSpans(SpanNearQuery.java:128)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:92)
           [junit4]    > 	at org.apache.lucene.search.Weight.bulkScorer(Weight.java:127)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:540)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:347)
           [junit4]    > 	at org.apache.lucene.queryparser.surround.query.BooleanQueryTst.doTest(BooleanQueryTst.java:127)
           [junit4]    > 	at org.apache.lucene.queryparser.surround.query.Test03Distance.distanceTst(Test03Distance.java:76)
           [junit4]    > 	at org.apache.lucene.queryparser.surround.query.Test03Distance.distanceTest3(Test03Distance.java:242)
           [junit4]    > 	at org.apache.lucene.queryparser.surround.query.Test03Distance.test3Example01(Test03Distance.java:247)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
        
        Show
        Robert Muir added a comment - By the way, at least i think this exercise may have found a bug. Now that new span logic is used everywhere, I hit some test failures in lucene/queryparser that look like this: [junit4] Suite: org.apache.lucene.queryparser.surround.query.Test03Distance [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=Test03Distance -Dtests.method=test3Example01 -Dtests.seed=87854C52BC77D8E3 -Dtests.locale=mt -Dtests.timezone=PRC -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.09s J2 | Test03Distance.test3Example01 <<< [junit4] > Throwable #1: java.lang.NullPointerException [junit4] > at __randomizedtesting.SeedInfo.seed([87854C52BC77D8E3:7DCE4F0323A56FA5]:0) [junit4] > at org.apache.lucene.search.spans.NearSpansUnordered.startPosition(NearSpansUnordered.java:213) [junit4] > at org.apache.lucene.search.spans.SpanOrQuery$SpanQueue.lessThan(SpanOrQuery.java:155) [junit4] > at org.apache.lucene.search.spans.SpanOrQuery$SpanQueue.lessThan(SpanOrQuery.java:147) [junit4] > at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:258) [junit4] > at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:135) [junit4] > at org.apache.lucene.search.spans.SpanOrQuery.getSpans(SpanOrQuery.java:187) [junit4] > at org.apache.lucene.search.spans.SpanNearQuery.getSpans(SpanNearQuery.java:128) [junit4] > at org.apache.lucene.search.spans.SpanNearQuery.getSpans(SpanNearQuery.java:128) [junit4] > at org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:92) [junit4] > at org.apache.lucene.search.Weight.bulkScorer(Weight.java:127) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:540) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:347) [junit4] > at org.apache.lucene.queryparser.surround.query.BooleanQueryTst.doTest(BooleanQueryTst.java:127) [junit4] > at org.apache.lucene.queryparser.surround.query.Test03Distance.distanceTst(Test03Distance.java:76) [junit4] > at org.apache.lucene.queryparser.surround.query.Test03Distance.distanceTest3(Test03Distance.java:242) [junit4] > at org.apache.lucene.queryparser.surround.query.Test03Distance.test3Example01(Test03Distance.java:247) [junit4] > at java.lang.Thread.run(Thread.java:745)
        Hide
        Alan Woodward added a comment -

        Should NO_MORE_POSITIONS be something other than MAX_INTEGER here? On LUCENE-2878 we tried using -1 instead because of possible clashes with documents that genuinely have a position of that value, but that's being used for 'unpositioned' in this patch.

        Show
        Alan Woodward added a comment - Should NO_MORE_POSITIONS be something other than MAX_INTEGER here? On LUCENE-2878 we tried using -1 instead because of possible clashes with documents that genuinely have a position of that value, but that's being used for 'unpositioned' in this patch.
        Hide
        Paul Elschot added a comment -

        I took a quick look at the changeapi patch. It looks good to me, and just comparing the size with the size of the patch of 22 March is enough to see that it is ok to change the Spans api for this.
        But the -1 value for unpositioned needs to be CHANGES.txt for this, as there are users that use -1 now for other reasons. Another value for unpositioned might still be possible.

        I tested the patch of patch of 22 March also with Integer.MAX_VALUE - 1 for NO_MORE_POSITIONS to verify that it is not mixed up with NO_MORE_DOCS.

        One minor point: there are still a few SpansEnum tests left in the patch.

        As to the failing test in the queryparser: probably this is caused by an empty spans being null and/or the changed initialization of the SpanOr queue.
        An empty spans used to be a special object, but the iterator here needs a bit of its own state even when it is empty, so I changed to using null for an empty Spans.
        With the sentinel values for docs (-1 no doc yet) and positions (-1 unpositioned) the SpanOr queue initialization is also done earlier in the patch of 22 March.

        I'll try the changeapi patch of 23 March to check my guess about the SpanOr queue.

        Show
        Paul Elschot added a comment - I took a quick look at the changeapi patch. It looks good to me, and just comparing the size with the size of the patch of 22 March is enough to see that it is ok to change the Spans api for this. But the -1 value for unpositioned needs to be CHANGES.txt for this, as there are users that use -1 now for other reasons. Another value for unpositioned might still be possible. I tested the patch of patch of 22 March also with Integer.MAX_VALUE - 1 for NO_MORE_POSITIONS to verify that it is not mixed up with NO_MORE_DOCS. One minor point: there are still a few SpansEnum tests left in the patch. As to the failing test in the queryparser: probably this is caused by an empty spans being null and/or the changed initialization of the SpanOr queue. An empty spans used to be a special object, but the iterator here needs a bit of its own state even when it is empty, so I changed to using null for an empty Spans. With the sentinel values for docs (-1 no doc yet) and positions (-1 unpositioned) the SpanOr queue initialization is also done earlier in the patch of 22 March. I'll try the changeapi patch of 23 March to check my guess about the SpanOr queue.
        Hide
        Robert Muir added a comment -

        Also, I ran very quick and dirty benchmarks, there are no performance regressions here. I saw small (5-10%) improvements when i benchmarked. I also tried benchmarking when using filters, but saw the same 5-10% in that case.

        I can try to look into it more, maybe something is not yet optimal about the two-phase iteration (improvements were much higher for phrase/sloppy when a filter was involved). It could also easily be the fact that our benchmarking support for spans is very limited.

        Show
        Robert Muir added a comment - Also, I ran very quick and dirty benchmarks, there are no performance regressions here. I saw small (5-10%) improvements when i benchmarked. I also tried benchmarking when using filters, but saw the same 5-10% in that case. I can try to look into it more, maybe something is not yet optimal about the two-phase iteration (improvements were much higher for phrase/sloppy when a filter was involved). It could also easily be the fact that our benchmarking support for spans is very limited.
        Hide
        Paul Elschot added a comment -

        This 2nd changeapi patch fixes a bug in NearSpansUnordered that is in the 1st changeapi patch.

        It also removes a superfluous SpansEnum test method.
        The TestSpansEnum class is still there, it can be renamed to for example TestSpans2.

        Show
        Paul Elschot added a comment - This 2nd changeapi patch fixes a bug in NearSpansUnordered that is in the 1st changeapi patch. It also removes a superfluous SpansEnum test method. The TestSpansEnum class is still there, it can be renamed to for example TestSpans2.
        Hide
        Paul Elschot added a comment - - edited

        Nice to see that Test03Distance.test3Example01 test case again

        On the performance: I'd expect 10-20% on larger docs, so 5-10% is ok.
        Spans.skipTo() did work nicely with filters already it seems.

        For two-phase iteration I'd expect a bigger impact on nested SpanNear queries with a mix of frequent and infrequent terms over the subqueries.
        I don't know wether there are test cases for that.

        Show
        Paul Elschot added a comment - - edited Nice to see that Test03Distance.test3Example01 test case again On the performance: I'd expect 10-20% on larger docs, so 5-10% is ok. Spans.skipTo() did work nicely with filters already it seems. For two-phase iteration I'd expect a bigger impact on nested SpanNear queries with a mix of frequent and infrequent terms over the subqueries. I don't know wether there are test cases for that.
        Hide
        David Smiley added a comment -

        Can you please use ReviewBoard?

        Show
        David Smiley added a comment - Can you please use ReviewBoard?
        Hide
        Paul Elschot added a comment -

        changeapi patch of 24 March, compared to previous changeapi patch:
        More efficient queue init in NearSpansUnordered.
        Improve constructor for NearSpansOrdered.
        Both are for nested span near queries.

        Show
        Paul Elschot added a comment - changeapi patch of 24 March, compared to previous changeapi patch: More efficient queue init in NearSpansUnordered. Improve constructor for NearSpansOrdered. Both are for nested span near queries.
        Hide
        Paul Elschot added a comment -

        Where is the ReviewBoard?

        Show
        Paul Elschot added a comment - Where is the ReviewBoard?
        Hide
        David Smiley added a comment -

        https://reviews.apache.org/dashboard/
        It's ideal for cases where there are going to be multiple patch updates with peer review.

        Show
        David Smiley added a comment - https://reviews.apache.org/dashboard/ It's ideal for cases where there are going to be multiple patch updates with peer review.
        Hide
        Robert Muir added a comment -

        reviewboard sucks, ive used it at a previous job.

        please, lets not bikeshed this here, here we are working on spans.

        this can be discussed somewhere else.

        Show
        Robert Muir added a comment - reviewboard sucks, ive used it at a previous job. please, lets not bikeshed this here, here we are working on spans. this can be discussed somewhere else.
        Hide
        Alan Woodward added a comment -

        Sorry, I wasn't clear - you can't use Integer.MAX_VALUE as a marker for 'positions are exhausted', because it's a perfectly valid (if unlikely) value for a position to hold. That's why I ended up returning a boolean from nextInterval() on LUCENE-6226. Does that work here?

        Show
        Alan Woodward added a comment - Sorry, I wasn't clear - you can't use Integer.MAX_VALUE as a marker for 'positions are exhausted', because it's a perfectly valid (if unlikely) value for a position to hold. That's why I ended up returning a boolean from nextInterval() on LUCENE-6226 . Does that work here?
        Hide
        David Smiley added a comment -

        If we can't use Integer.MAX_VALUE as NO_MORE_POSITIONS then I propose we at least pave the way for this to become the case for a future Lucene version. It would be nice to have this for API consistency. In some of my work I have abused positions to hold data but in doing so I didn't assume negative values or Integer.MAX_VALUE would work. If I did and if it did work (I guess it would?), I wouldn't expect that kind of usage abuse to necessarily be compatible between minor Lucene releases as I already know I'm abusing something outside of its intended use.

        Show
        David Smiley added a comment - If we can't use Integer.MAX_VALUE as NO_MORE_POSITIONS then I propose we at least pave the way for this to become the case for a future Lucene version. It would be nice to have this for API consistency. In some of my work I have abused positions to hold data but in doing so I didn't assume negative values or Integer.MAX_VALUE would work. If I did and if it did work (I guess it would?), I wouldn't expect that kind of usage abuse to necessarily be compatible between minor Lucene releases as I already know I'm abusing something outside of its intended use.
        Hide
        David Smiley added a comment -

        Without the benefit of a code review tool; I'll put my comments here:

        NearSpans:

        • Why 3x in cost()? SpanNotQuery has 2x cost. Hmmm.

        NearSpansOrdered:

        • The payload collection in shrinkToAfterShortestMatch seems suspect. I think the loop over prevSpans should only assign a new ArrayList to possiblePayload if it was previously null, and otherwise should call clear. And this should happen wether or not prevSpans.isPayloadAvailable(). possiblePayload should be pluralized to more clearly indicate it’s a collection of payloads, and a comment could clarify we declare it out of the loop to re-use the instance. Heck; maybe it should be declared as a field of this spans along with possibleMatchPayloads set for re-use, and then always initialize them so long as collectPayloads==true. And, related, I think matchPayload should be an ArrayList not LinkedList — I’ve seen payload iteration code that optimized for RandomAccess so lets return one.
        • Thanks to ConjunctionDISI, it’s nice to see a net reduction in code in NearSpans*
        • Nice use of FilterSpans for NearSpansUnordered.SpansCell and elsewhere

        SpanPositionCheckQuery

        • hashCode & equals should be defined. Then it’s subclasses (some of which you have modified) can call super.equals & super.hashCode. Furthermore, as of March 4th, we needn’t add getClass().hashCode as Query.hashCode will do that.
        • PositionCheckSpans does not implement asTwoPhaseIterator. Might this be a problem? If the caller gets the TPI, then it's positions won't be filtered by acceptPosition(). Perhaps FilterSpans shouldn't delegate asTwoPhaseIterator because of subtle bugs like this.

        SpanNearQuery

        • getSpans: why the removal of the 1-clause and 0-clause optimized cases? And I noticed you didn’t propagate collectPayloads to NearSpansOrdered. (The Unordered case doesn’t need a boolean).

        SpanOrQuery

        • Spans.cost(): why not pre-compute as it was? cost() could get called a bunch during sorting of queries.

        SpanNotQuery

        • is asTwoPhraseIterator a TODO? I think there's value in it here.

        SpanScorer:

        • shouldn’t this be implementing asTwoPhaseIterator to delegate to the spans?

        Overall, this is clearly going to bring some great speed-ups to complex span queries. Nice job Paul! Longer term, I hope LUCENE-2878 (nuke Spans) comes to fruition so that we still don't have the Query dichotomy.

        Show
        David Smiley added a comment - Without the benefit of a code review tool; I'll put my comments here: NearSpans: Why 3x in cost()? SpanNotQuery has 2x cost. Hmmm. NearSpansOrdered: The payload collection in shrinkToAfterShortestMatch seems suspect. I think the loop over prevSpans should only assign a new ArrayList to possiblePayload if it was previously null, and otherwise should call clear. And this should happen wether or not prevSpans.isPayloadAvailable(). possiblePayload should be pluralized to more clearly indicate it’s a collection of payloads, and a comment could clarify we declare it out of the loop to re-use the instance. Heck; maybe it should be declared as a field of this spans along with possibleMatchPayloads set for re-use, and then always initialize them so long as collectPayloads==true. And, related, I think matchPayload should be an ArrayList not LinkedList — I’ve seen payload iteration code that optimized for RandomAccess so lets return one. Thanks to ConjunctionDISI, it’s nice to see a net reduction in code in NearSpans* Nice use of FilterSpans for NearSpansUnordered.SpansCell and elsewhere SpanPositionCheckQuery hashCode & equals should be defined. Then it’s subclasses (some of which you have modified) can call super.equals & super.hashCode. Furthermore, as of March 4th, we needn’t add getClass().hashCode as Query.hashCode will do that. PositionCheckSpans does not implement asTwoPhaseIterator. Might this be a problem? If the caller gets the TPI, then it's positions won't be filtered by acceptPosition(). Perhaps FilterSpans shouldn't delegate asTwoPhaseIterator because of subtle bugs like this. SpanNearQuery getSpans: why the removal of the 1-clause and 0-clause optimized cases? And I noticed you didn’t propagate collectPayloads to NearSpansOrdered. (The Unordered case doesn’t need a boolean). SpanOrQuery Spans.cost(): why not pre-compute as it was? cost() could get called a bunch during sorting of queries. SpanNotQuery is asTwoPhraseIterator a TODO? I think there's value in it here. SpanScorer: shouldn’t this be implementing asTwoPhaseIterator to delegate to the spans? Overall, this is clearly going to bring some great speed-ups to complex span queries. Nice job Paul! Longer term, I hope LUCENE-2878 (nuke Spans) comes to fruition so that we still don't have the Query dichotomy.
        Hide
        Paul Elschot added a comment -

        For the NO_MORE_POSITIONS: personally I actually slightly prefer to have a boolean method, however for consistency with the doc id iteration I implemented it as Integer.MAX_VALUE.
        So I'm fine with both solutions.

        About cost(): we need a good overall view for that, I did not try that for now.
        I changed OrSpans cost() because I suspected an initalization problem, I'll change that to precomputed again.

        The payload collection in NearSpansOrdered needs work. I'd prefer to leave that to a subclass for the ordered case.
        For now I'll check the collectPayloads passing in SpanNearQuery.

        SpanPositionCheckQuery hashCode and equals are indeed needed, iirc these were already lacking.

        SpanNear: when the subSpans are initialized and one of them is null, null is returned for the NearSpans... .
        I think SpanNear should have at least two subqueries, I did not expect that to break backward compatibility.
        The single subquery case could be restored, but is it really needed? When this happens the caller should be aware that the slop is going to be ignored, and in the ordered case there should be no match at all because the subSpans should not overlap then.

        SpanNotQuery and SpanScorer: as I understood there is no need for these two implement TwoPhaseIterator because that is only needed when there is some form of conjunction. SpanScorer only takes a single Spans, and for SpanNot advance() on the excluded spans only is good enough.

        Removing Spans/intervals completely is not an option: with (nested) proximity queries there is no way to avoid iterating over intervals/spans. We can make the scoring consistent between Spans and PhraseQuery, and PhraseQuery with getSpans() would also be good for performance of common cases.

        Show
        Paul Elschot added a comment - For the NO_MORE_POSITIONS: personally I actually slightly prefer to have a boolean method, however for consistency with the doc id iteration I implemented it as Integer.MAX_VALUE. So I'm fine with both solutions. About cost(): we need a good overall view for that, I did not try that for now. I changed OrSpans cost() because I suspected an initalization problem, I'll change that to precomputed again. The payload collection in NearSpansOrdered needs work. I'd prefer to leave that to a subclass for the ordered case. For now I'll check the collectPayloads passing in SpanNearQuery. SpanPositionCheckQuery hashCode and equals are indeed needed, iirc these were already lacking. SpanNear: when the subSpans are initialized and one of them is null, null is returned for the NearSpans... . I think SpanNear should have at least two subqueries, I did not expect that to break backward compatibility. The single subquery case could be restored, but is it really needed? When this happens the caller should be aware that the slop is going to be ignored, and in the ordered case there should be no match at all because the subSpans should not overlap then. SpanNotQuery and SpanScorer: as I understood there is no need for these two implement TwoPhaseIterator because that is only needed when there is some form of conjunction. SpanScorer only takes a single Spans, and for SpanNot advance() on the excluded spans only is good enough. Removing Spans/intervals completely is not an option: with (nested) proximity queries there is no way to avoid iterating over intervals/spans. We can make the scoring consistent between Spans and PhraseQuery, and PhraseQuery with getSpans() would also be good for performance of common cases.
        Hide
        Paul Elschot added a comment -

        2nd changeapi patch of 24 March 2015. Compared to the earlier one:

        Use cost() only once on subSpans in SpanOr.

        Add equals/hashCode to SpanPositionCheckQuery.

        Split NearSpansPayloadOrdered from NearSpansOrdered, use collectPayloads in SpanNearQuery.
        NearSpansOrdered now throws UOE for span collecting methods.

        Tests for core and queryparser modules pass.

        Show
        Paul Elschot added a comment - 2nd changeapi patch of 24 March 2015. Compared to the earlier one: Use cost() only once on subSpans in SpanOr. Add equals/hashCode to SpanPositionCheckQuery. Split NearSpansPayloadOrdered from NearSpansOrdered, use collectPayloads in SpanNearQuery. NearSpansOrdered now throws UOE for span collecting methods. Tests for core and queryparser modules pass.
        Hide
        Paul Elschot added a comment -

        we needn’t add getClass().hashCode as Query.hashCode will do that.

        Sorry, I missed that, and I also did not add equals/hashCode to the subclasses of SpanPositionCheckQuery.

        Show
        Paul Elschot added a comment - we needn’t add getClass().hashCode as Query.hashCode will do that. Sorry, I missed that, and I also did not add equals/hashCode to the subclasses of SpanPositionCheckQuery.
        Hide
        David Smiley added a comment -

        The pain of not using any review tool (no matter how imperfect any might be) is quite apparent to me with these patch updates and multi-user interaction. Let the pain continue? Rob; you could ignore the RB if so long as if the patches are here for you to do the perf testing. I don't mean to suggest we should use a review tool for every JIRA issue.

        SpanNear: when the subSpans are initialized and one of them is null, null is returned for the NearSpans... . I think SpanNear should have at least two subqueries, I did not expect that to break backward compatibility. The single subquery case could be restored, but is it really needed? When this happens the caller should be aware that the slop is going to be ignored, and in the ordered case there should be no match at all because the subSpans should not overlap then.

        I simply recognized it as a difference and wondered why. If some thing "should" not happen (you claim SpanNear should have at least 2 sub-queries) then an assertion or comment to that effect would be good. As to the particulars of this case, I'm not sure at the moment without further inspection.

        SpanNotQuery and SpanScorer: as I understood there is no need for these two implement TwoPhaseIterator because that is only needed when there is some form of conjunction. SpanScorer only takes a single Spans, and for SpanNot advance() on the excluded spans only is good enough.

        I believe it's useful for any aggregation, even over one so that the TPI benefits percolate across parent/child queries in aggregate. Adrien Grand please check me on this. Having said this I don't know why TPIs aren't propagated more generally; maybe a TODO?.

        On to my example: If any Query that wraps another Query (technically a Scorer wrapping another Scorer, or Span wrapping another Span here) fails to propagate the TwoPhaseIterator, then the benefit of the TPI is only localized to the nodes that directly support it (e.g. BooleanQuery & SpanNearQuery). Consider the case of a top level BooleanQuery wrapping two MUST clauses, one clause with a medium-frequency term, and the other clause being a SpanNearQuery between some other SpanQuery derivatives which include a low frequency term. With your patch's SpanScorer not supporting TPI, the top level BQ won't see a TPI so it will drive the query by the low-cost SpanQuery which will force it to visit positions for documents that could have been filtered by the top level term query's document list.

        Removing Spans/intervals completely is not an option: with (nested) proximity queries there is no way to avoid iterating over intervals/spans. We can make the scoring consistent between Spans and PhraseQuery, and PhraseQuery with getSpans() would also be good for performance of common cases.

        I could have been clearer. I simply mean I look forward to the dichotomy being gone or made easier. That is, today, a SpanNearQuery only accepts SpanQueries, which forces us to convert other queries to them (even TermQuery), which is a pain. Any way, that discussion should be left to LUCENE-2878 so I shouldn't have brought it up here.

        Show
        David Smiley added a comment - The pain of not using any review tool (no matter how imperfect any might be) is quite apparent to me with these patch updates and multi-user interaction. Let the pain continue? Rob; you could ignore the RB if so long as if the patches are here for you to do the perf testing. I don't mean to suggest we should use a review tool for every JIRA issue. SpanNear: when the subSpans are initialized and one of them is null, null is returned for the NearSpans... . I think SpanNear should have at least two subqueries, I did not expect that to break backward compatibility. The single subquery case could be restored, but is it really needed? When this happens the caller should be aware that the slop is going to be ignored, and in the ordered case there should be no match at all because the subSpans should not overlap then. I simply recognized it as a difference and wondered why. If some thing "should" not happen (you claim SpanNear should have at least 2 sub-queries) then an assertion or comment to that effect would be good. As to the particulars of this case, I'm not sure at the moment without further inspection. SpanNotQuery and SpanScorer: as I understood there is no need for these two implement TwoPhaseIterator because that is only needed when there is some form of conjunction. SpanScorer only takes a single Spans, and for SpanNot advance() on the excluded spans only is good enough. I believe it's useful for any aggregation, even over one so that the TPI benefits percolate across parent/child queries in aggregate. Adrien Grand please check me on this. Having said this I don't know why TPIs aren't propagated more generally; maybe a TODO?. On to my example: If any Query that wraps another Query (technically a Scorer wrapping another Scorer, or Span wrapping another Span here) fails to propagate the TwoPhaseIterator, then the benefit of the TPI is only localized to the nodes that directly support it (e.g. BooleanQuery & SpanNearQuery). Consider the case of a top level BooleanQuery wrapping two MUST clauses, one clause with a medium-frequency term, and the other clause being a SpanNearQuery between some other SpanQuery derivatives which include a low frequency term. With your patch's SpanScorer not supporting TPI, the top level BQ won't see a TPI so it will drive the query by the low-cost SpanQuery which will force it to visit positions for documents that could have been filtered by the top level term query's document list. Removing Spans/intervals completely is not an option: with (nested) proximity queries there is no way to avoid iterating over intervals/spans. We can make the scoring consistent between Spans and PhraseQuery, and PhraseQuery with getSpans() would also be good for performance of common cases. I could have been clearer. I simply mean I look forward to the dichotomy being gone or made easier. That is, today, a SpanNearQuery only accepts SpanQueries, which forces us to convert other queries to them (even TermQuery), which is a pain. Any way, that discussion should be left to LUCENE-2878 so I shouldn't have brought it up here.
        Hide
        Paul Elschot added a comment -

        I created some spinoff issues from this to allow easier progress here: LUCENE-6371, LUCENE-6372 and LUCENE-6373.
        That leaves Spans.cost(), the minimum number of subqueries for SpanNear, and NO_MORE_POSITIONS to be resolved here,

        Show
        Paul Elschot added a comment - I created some spinoff issues from this to allow easier progress here: LUCENE-6371 , LUCENE-6372 and LUCENE-6373 . That leaves Spans.cost(), the minimum number of subqueries for SpanNear, and NO_MORE_POSITIONS to be resolved here,
        Hide
        Paul Elschot added a comment -

        changeapi patch of 26 March 2015. Compared to previous changeapi patch:

        cost() functions return expected number of matching docs.

        Documents that SpanNearQuery should have at least 2 clauses.

        Leaves NO_MORE_POSITIONS at Integer.MAX_VALUE.

        Show
        Paul Elschot added a comment - changeapi patch of 26 March 2015. Compared to previous changeapi patch: cost() functions return expected number of matching docs. Documents that SpanNearQuery should have at least 2 clauses. Leaves NO_MORE_POSITIONS at Integer.MAX_VALUE.
        Hide
        Michael McCandless added a comment -

        +1 to the latest patch: it's a great step forward for spans. We can
        iterate more in the follow-on issues... I think we should commit what
        we have here now?

        The patch just needs a minor fix to NearSpans' TwoPhaseIterator now
        that the approximation is passed to the ctor.

        It's wonderful that Spans now extends DISI, just adding the position
        iteration API, and I like adopting the same -1 / Integer.MAX_VALUE
        sentinels we use for docs iteration.

        Show
        Michael McCandless added a comment - +1 to the latest patch: it's a great step forward for spans. We can iterate more in the follow-on issues... I think we should commit what we have here now? The patch just needs a minor fix to NearSpans' TwoPhaseIterator now that the approximation is passed to the ctor. It's wonderful that Spans now extends DISI, just adding the position iteration API, and I like adopting the same -1 / Integer.MAX_VALUE sentinels we use for docs iteration.
        Hide
        Alan Woodward added a comment -

        I don't think Integer.MAX_VALUE works though? What if you have a term that's indexed at that position? SpanTermQuery won't find it. I think we have to use a boolean here (or a different value like -2, but that makes comparisons between positions more complicated).

        Show
        Alan Woodward added a comment - I don't think Integer.MAX_VALUE works though? What if you have a term that's indexed at that position? SpanTermQuery won't find it. I think we have to use a boolean here (or a different value like -2, but that makes comparisons between positions more complicated).
        Hide
        Michael McCandless added a comment -

        What if you have a term that's indexed at that position?

        You cannot: it's a reserved value. Just like no doc id can be Integer.MAX_VALUE.

        And if you have 2.1B tokens in one document you are going to have other problems...

        We should fix indexing to detect if this happens and reject the document, but that should be a separate issue and I don't think it should block this one...

        Show
        Michael McCandless added a comment - What if you have a term that's indexed at that position? You cannot: it's a reserved value. Just like no doc id can be Integer.MAX_VALUE. And if you have 2.1B tokens in one document you are going to have other problems... We should fix indexing to detect if this happens and reject the document, but that should be a separate issue and I don't think it should block this one...
        Hide
        Robert Muir added a comment -

        Alan is correct. Mike, the issue is you don't need 2.1B tokens to hit this. And I don't know where "its a reserved value" comes from. At least nothing is reserving it, in indexwriter or elsewhere. Instead we just check for negatives/overflow.

        I've seen people use large posInc gaps between fields. This can make huge position numbers. Also if someone forgets clearAttributes the positions grow exponentially. Sure its bad, but for small docs i bet plenty of people have HUGE positions and don't realize it.

        Can we just change the constant to -2? Then the problem is solved.

        Show
        Robert Muir added a comment - Alan is correct. Mike, the issue is you don't need 2.1B tokens to hit this. And I don't know where "its a reserved value" comes from. At least nothing is reserving it, in indexwriter or elsewhere. Instead we just check for negatives/overflow. I've seen people use large posInc gaps between fields. This can make huge position numbers. Also if someone forgets clearAttributes the positions grow exponentially. Sure its bad, but for small docs i bet plenty of people have HUGE positions and don't realize it. Can we just change the constant to -2? Then the problem is solved.
        Hide
        Paul Elschot added a comment -

        From the current TermSpans.java:

          public int end() {
            return position + 1;
          }
        

        For position == Integer.MAX_VALUE that has never behaved well.

        Show
        Paul Elschot added a comment - From the current TermSpans.java: public int end() { return position + 1; } For position == Integer.MAX_VALUE that has never behaved well.
        Hide
        Paul Elschot added a comment - - edited

        The patch just needs a minor fix to NearSpans' TwoPhaseIterator now that the approximation is passed to the ctor.

        The git mirrors are a little behind again, see INFRA-9182.

        Show
        Paul Elschot added a comment - - edited The patch just needs a minor fix to NearSpans' TwoPhaseIterator now that the approximation is passed to the ctor. The git mirrors are a little behind again, see INFRA-9182 .
        Hide
        Michael McCandless added a comment -

        I've seen people use large posInc gaps between fields. This can make huge position numbers. Also if someone forgets clearAttributes the positions grow exponentially. Sure its bad, but for small docs i bet plenty of people have HUGE positions and don't realize it.

        I think such examples are really abuse cases? We shouldn't design
        for abuse cases...

        Also such users (jumping by enormous position increments each time)
        are unlikely to precisely hit Integer.MAX_VALUE ... they are more
        likely to overflow it.

        What I find compelling about Integer.MAX_VALUE is it makes priority
        queues that are merge-sorting N position iterators work "naturally",
        so they can simply compare by position, and only once all iterators
        are "on" a position must they check whether that position is
        Integer.MAX_VALUE. But if we use -2, then every time we .nextPosition
        each iterator we must check if it's ended.

        I do agree we should fix IW to detect this during indexing, and
        CheckIndex to detect it.

        I also like the consistency with NO_MORE_DOCS.

        Show
        Michael McCandless added a comment - I've seen people use large posInc gaps between fields. This can make huge position numbers. Also if someone forgets clearAttributes the positions grow exponentially. Sure its bad, but for small docs i bet plenty of people have HUGE positions and don't realize it. I think such examples are really abuse cases? We shouldn't design for abuse cases... Also such users (jumping by enormous position increments each time) are unlikely to precisely hit Integer.MAX_VALUE ... they are more likely to overflow it. What I find compelling about Integer.MAX_VALUE is it makes priority queues that are merge-sorting N position iterators work "naturally", so they can simply compare by position, and only once all iterators are "on" a position must they check whether that position is Integer.MAX_VALUE. But if we use -2, then every time we .nextPosition each iterator we must check if it's ended. I do agree we should fix IW to detect this during indexing, and CheckIndex to detect it. I also like the consistency with NO_MORE_DOCS.
        Hide
        Michael McCandless added a comment -

        From the current TermSpans.java:

        Hmm so if your index has a position = Integer.MAX_VALUE, things already won't work for SpanTermQuery... I wonder whether PhraseQuery works today when position is Integer.MAX_VALUE.

        Show
        Michael McCandless added a comment - From the current TermSpans.java: Hmm so if your index has a position = Integer.MAX_VALUE, things already won't work for SpanTermQuery... I wonder whether PhraseQuery works today when position is Integer.MAX_VALUE.
        Hide
        Paul Elschot added a comment -

        changeapi patch of 28 March 2015. Compared to the previous patch:

        Fix the javadocs to mention that Spans should not be used for terms indexed at position Integer.MAX_VALUE.

        Use the TwoPhaseIterator constructor from LUCENE-6198.

        Replace some asserts in constructors by Objects.requireNonNull, and simplify some hashCode/equals methods to no more test for null.

        Remove unused variable possibleMatchPayloads in NearSpansOrdered.

        Show
        Paul Elschot added a comment - changeapi patch of 28 March 2015. Compared to the previous patch: Fix the javadocs to mention that Spans should not be used for terms indexed at position Integer.MAX_VALUE. Use the TwoPhaseIterator constructor from LUCENE-6198 . Replace some asserts in constructors by Objects.requireNonNull, and simplify some hashCode/equals methods to no more test for null. Remove unused variable possibleMatchPayloads in NearSpansOrdered.
        Hide
        Michael McCandless added a comment -

        Thanks Paul Elschot patch looks great, +1 to commit...

        Show
        Michael McCandless added a comment - Thanks Paul Elschot patch looks great, +1 to commit...
        Hide
        Michael McCandless added a comment -

        If there are no objections, I'll commit this in a day or two ... thanks Paul Elschot!

        Show
        Michael McCandless added a comment - If there are no objections, I'll commit this in a day or two ... thanks Paul Elschot !
        Hide
        Adrien Grand added a comment -

        +1 to commit

        Show
        Adrien Grand added a comment - +1 to commit
        Hide
        David Smiley added a comment -

        +1 to commit, thanks Paul!

        Adrien Grand would you mind commenting on my example above – right below where I mentioned you to get your input?

        Show
        David Smiley added a comment - +1 to commit, thanks Paul! Adrien Grand would you mind commenting on my example above – right below where I mentioned you to get your input?
        Hide
        Adrien Grand added a comment -

        David Smiley You are right that spans should propagate two-phase approximations so that they can be used when intersected with another query (eg. a boolean FILTER clause). I believe LUCENE-6373 has been open for that reason, maybe we should be more explicit about the remaining work that we need to do. As far as I am concerned, I'm very happy to see spans becoming lazy, which was imo the greatest challenge to get two-phase iteration working on spans.

        Show
        Adrien Grand added a comment - David Smiley You are right that spans should propagate two-phase approximations so that they can be used when intersected with another query (eg. a boolean FILTER clause). I believe LUCENE-6373 has been open for that reason, maybe we should be more explicit about the remaining work that we need to do. As far as I am concerned, I'm very happy to see spans becoming lazy, which was imo the greatest challenge to get two-phase iteration working on spans.
        Hide
        ASF subversion and git services added a comment -

        Commit 1670272 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1670272 ]

        LUCENE-6308: cutover Spans to DISI, reuse ConjunctionDISI, use two-phased iteration

        Show
        ASF subversion and git services added a comment - Commit 1670272 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1670272 ] LUCENE-6308 : cutover Spans to DISI, reuse ConjunctionDISI, use two-phased iteration
        Hide
        ASF subversion and git services added a comment -

        Commit 1670273 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1670273 ]

        LUCENE-6308: cutover Spans to DISI, reuse ConjunctionDISI, use two-phased iteration

        Show
        ASF subversion and git services added a comment - Commit 1670273 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1670273 ] LUCENE-6308 : cutover Spans to DISI, reuse ConjunctionDISI, use two-phased iteration
        Hide
        Michael McCandless added a comment -

        Thanks Paul Elschot.

        I'll open a follow-on issue for IW/CheckIndex to detect the now illegal position=Int.MAX_VALUE going forward...

        Show
        Michael McCandless added a comment - Thanks Paul Elschot . I'll open a follow-on issue for IW/CheckIndex to detect the now illegal position=Int.MAX_VALUE going forward...
        Hide
        ASF subversion and git services added a comment -

        Commit 1670278 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1670278 ]

        LUCENE-6308: fix test bug

        Show
        ASF subversion and git services added a comment - Commit 1670278 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1670278 ] LUCENE-6308 : fix test bug
        Hide
        ASF subversion and git services added a comment -

        Commit 1670279 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1670279 ]

        LUCENE-6308: woops: revert

        Show
        ASF subversion and git services added a comment - Commit 1670279 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1670279 ] LUCENE-6308 : woops: revert
        Hide
        ASF subversion and git services added a comment -

        Commit 1670280 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1670280 ]

        LUCENE-6308: fix test bug

        Show
        ASF subversion and git services added a comment - Commit 1670280 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1670280 ] LUCENE-6308 : fix test bug
        Hide
        ASF subversion and git services added a comment -

        Commit 1670281 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1670281 ]

        LUCENE-6308: fix test bug

        Show
        ASF subversion and git services added a comment - Commit 1670281 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1670281 ] LUCENE-6308 : fix test bug
        Hide
        Paul Elschot added a comment -

        My pleasure.
        I'll share the thanks with Robert Muir, who provided the first changeapi patch.

        Show
        Paul Elschot added a comment - My pleasure. I'll share the thanks with Robert Muir , who provided the first changeapi patch.
        Hide
        Anshum Gupta added a comment -

        Bulk close for 5.2.0.

        Show
        Anshum Gupta added a comment - Bulk close for 5.2.0.

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            1 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development