Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spin off from LUCENE-6308, see the comments there from around 23 March 2015.

      1. LUCENE-6371.patch
        73 kB
        Alan Woodward
      2. LUCENE-6371.patch
        86 kB
        Alan Woodward
      3. LUCENE-6371.patch
        68 kB
        Alan Woodward
      4. LUCENE-6371.patch
        64 kB
        Alan Woodward
      5. LUCENE-6371.patch
        91 kB
        Alan Woodward
      6. LUCENE-6371.patch
        113 kB
        Alan Woodward
      7. LUCENE-6371.patch
        109 kB
        Alan Woodward
      8. LUCENE-6371.patch
        109 kB
        Alan Woodward
      9. LUCENE-6371-5x.patch
        98 kB
        Alan Woodward
      10. LUCENE-6371-5x.patch
        181 kB
        Alan Woodward

        Activity

        Hide
        Robert Muir added a comment -

        I propose we pull the current payload api (getPayload, hasPayloadAvailable) out of Spans and move it to a PayloadSpans and make the .payloads package just require that.
        .payloads already has duplicates of almost all functionality anyway (including term and near).

        This would let us simplify spans and make them efficient for the majority case. Apis with Collection<byte[]> per position are not efficient.

        Separately, fixing the payload API to be more efficient could then be an independent issue: hopefully proposing a new api that requires the user specify how payloads should be combined or something like that.

        Show
        Robert Muir added a comment - I propose we pull the current payload api (getPayload, hasPayloadAvailable) out of Spans and move it to a PayloadSpans and make the .payloads package just require that. .payloads already has duplicates of almost all functionality anyway (including term and near). This would let us simplify spans and make them efficient for the majority case. Apis with Collection<byte[]> per position are not efficient. Separately, fixing the payload API to be more efficient could then be an independent issue: hopefully proposing a new api that requires the user specify how payloads should be combined or something like that.
        Hide
        Robert Muir added a comment -

        I also think the current payload support should be moved out of core lucene into sandbox/

        Show
        Robert Muir added a comment - I also think the current payload support should be moved out of core lucene into sandbox/
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Paul Elschot added a comment -

        I just investigated this a little by collecting these:

          public class IndexedPayload {
            public int position;
            public Term term;
            public BytesRef bytesRef;
          }
        

        But now I'm realizing that the problem here is that we don't know in advance which indexed information is available and should be provided for each term: the term itself, its position, payload and/or offsets.
        This is similar to a TokenStream and its attributes, so why not provide a TokenStream instead?
        This would boil down to providing a subset (matching the query) of the TokenStream used to index the document.
        The PrefillTokenStream of LUCENE-5687 could maybe also be used for this.

        The question is: could that work well?
        Would such a TokenStream be good for highlighting, too?

        Show
        Paul Elschot added a comment - I just investigated this a little by collecting these: public class IndexedPayload { public int position; public Term term; public BytesRef bytesRef; } But now I'm realizing that the problem here is that we don't know in advance which indexed information is available and should be provided for each term: the term itself, its position, payload and/or offsets. This is similar to a TokenStream and its attributes, so why not provide a TokenStream instead? This would boil down to providing a subset (matching the query) of the TokenStream used to index the document. The PrefillTokenStream of LUCENE-5687 could maybe also be used for this. The question is: could that work well? Would such a TokenStream be good for highlighting, too?
        Hide
        Paul Elschot added a comment -

        See also LUCENE-6473

        Show
        Paul Elschot added a comment - See also LUCENE-6473
        Hide
        Alan Woodward added a comment -

        I've been playing around with various APIs for this, and I think this one works reasonably well.

        Spans.isPayloadAvailable() and getPayload() are replaced with a collect() method that takes a SpanCollector. If you want to get payloads from a Spans, you do the following:

        PayloadSpanCollector collector = new PayloadSpanCollector();
        while (spans.nextStartPosition() != NO_MORE_POSITIONS) {
          collector.reset();
          spans.collect(collector);
          doSomethingWith(collector.getPayloads());
        }
        

        The actual job of collecting information from postings lists is devolved to the collector itself (via SpanCollector.collectLeaf(), called from TermSpans.collect()).

        The API is made slightly complicated by the need to buffer collected information in NearOrderedSpans, because the algorithm there moves child spans on eagerly when finding the smallest possible match, so by the time collect() is called we're out of position. This is dealt with using a BufferedSpanCollector, with collectCandidate(Spans) and accept() methods. The default (No-op) collector has a no-op implementation of this, which should get optimized away by HotSpot, meaning that we don't need to have separate implementations for collecting and non-collecting algorithms, and can do away with PayloadNearOrderedSpans.

        This patch also moves the PayloadCheck queries to the .payloads package, which tidies things up a bit.

        All tests pass.

        Show
        Alan Woodward added a comment - I've been playing around with various APIs for this, and I think this one works reasonably well. Spans.isPayloadAvailable() and getPayload() are replaced with a collect() method that takes a SpanCollector. If you want to get payloads from a Spans, you do the following: PayloadSpanCollector collector = new PayloadSpanCollector(); while (spans.nextStartPosition() != NO_MORE_POSITIONS) { collector.reset(); spans.collect(collector); doSomethingWith(collector.getPayloads()); } The actual job of collecting information from postings lists is devolved to the collector itself (via SpanCollector.collectLeaf(), called from TermSpans.collect()). The API is made slightly complicated by the need to buffer collected information in NearOrderedSpans, because the algorithm there moves child spans on eagerly when finding the smallest possible match, so by the time collect() is called we're out of position. This is dealt with using a BufferedSpanCollector, with collectCandidate(Spans) and accept() methods. The default (No-op) collector has a no-op implementation of this, which should get optimized away by HotSpot, meaning that we don't need to have separate implementations for collecting and non-collecting algorithms, and can do away with PayloadNearOrderedSpans. This patch also moves the PayloadCheck queries to the .payloads package, which tidies things up a bit. All tests pass.
        Hide
        David Smiley added a comment -

        I really like this design, because it enables one to build a highlighter that is accurate (so-called “query debugging”). That’s a huge bonus I wasn’t expecting from this patch (based on the issue title/description). But I think something is missing — SpanCollector.collectLeaf doesn’t provide access to the SpanQuery or perhaps the Term that is being collected.

        Might SpanCollector.DEFAULT be renamed to NO_OP? Same for BufferedSpanCollector.NO_OP. I think NO_OP is more clear as to what this implementation does.

        PayloadSpanCollector should use BytesRefArray instead of an ArrayList<byte[]>; and it can return this from getPayloads()

        What is the purpose of the start & end position params to collectLeaf()? No implementation uses them (on consumer or implementer side) and I'm not sure how they might be used.

        Show
        David Smiley added a comment - I really like this design, because it enables one to build a highlighter that is accurate (so-called “query debugging”). That’s a huge bonus I wasn’t expecting from this patch (based on the issue title/description). But I think something is missing — SpanCollector.collectLeaf doesn’t provide access to the SpanQuery or perhaps the Term that is being collected. Might SpanCollector.DEFAULT be renamed to NO_OP? Same for BufferedSpanCollector.NO_OP. I think NO_OP is more clear as to what this implementation does. PayloadSpanCollector should use BytesRefArray instead of an ArrayList<byte[]>; and it can return this from getPayloads() What is the purpose of the start & end position params to collectLeaf()? No implementation uses them (on consumer or implementer side) and I'm not sure how they might be used.
        Hide
        Alan Woodward added a comment -

        Updated patch:

        • collectLeaf() now takes PostingsEnum and Term
        • the default impls are renamed to NO_OP

        Changing from Collection<byte[]> to BytesRefArray is a great idea, but I'd like to do that in a separate issue as that effects the external SpanQuery API a fair amount. This patch currently only changes internals.

        Show
        Alan Woodward added a comment - Updated patch: collectLeaf() now takes PostingsEnum and Term the default impls are renamed to NO_OP Changing from Collection<byte[]> to BytesRefArray is a great idea, but I'd like to do that in a separate issue as that effects the external SpanQuery API a fair amount. This patch currently only changes internals.
        Hide
        Paul Elschot added a comment -

        The patch removes the duplication of NearSpans(Payload)Ordered, which was only introduced for to allow better performance when not collecting spans. Removing this duplication while keeping the performance is very nice.
        I also like the approach of leaving payload collection to a SpanCollector, and I would not mind moving from Collection<byte[]> to BytesRefArray right away.

        Show
        Paul Elschot added a comment - The patch removes the duplication of NearSpans(Payload)Ordered, which was only introduced for to allow better performance when not collecting spans. Removing this duplication while keeping the performance is very nice. I also like the approach of leaving payload collection to a SpanCollector, and I would not mind moving from Collection<byte[]> to BytesRefArray right away.
        Hide
        Alan Woodward added a comment -

        Updated with a highlighter fix as well.

        There's one change in behaviour here with respect to payload collection, in that previously a TermSpans would only read a payload once. This seems a bit odd to me, and it meant that, for example, UnorderedNearQueries with overlapping matches would sometimes read incomplete payloads. I've had to change PositionIncrementTest to take this change into account.

        I'll look at benchmarking this properly with luceneutil.

        Show
        Alan Woodward added a comment - Updated with a highlighter fix as well. There's one change in behaviour here with respect to payload collection, in that previously a TermSpans would only read a payload once. This seems a bit odd to me, and it meant that, for example, UnorderedNearQueries with overlapping matches would sometimes read incomplete payloads. I've had to change PositionIncrementTest to take this change into account. I'll look at benchmarking this properly with luceneutil.
        Hide
        Alan Woodward added a comment - - edited
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
                        Wildcard     1407.35     (12.4%)     1334.61     (15.1%)   -5.2% ( -29% -   25%)
                         Prefix3     1987.02     (16.6%)     1894.70     (14.2%)   -4.6% ( -30% -   31%)
                          IntNRQ      566.03     (11.0%)      544.01     (14.8%)   -3.9% ( -26% -   24%)
                      AndHighMed     1158.92      (8.9%)     1118.93     (11.1%)   -3.5% ( -21% -   18%)
                          Fuzzy2       55.93     (12.5%)       54.23     (10.4%)   -3.0% ( -23% -   22%)
                      OrHighHigh      615.32      (8.2%)      598.05      (7.5%)   -2.8% ( -17% -   14%)
                       OrHighLow      751.47      (7.1%)      736.36      (7.6%)   -2.0% ( -15% -   13%)
                        HighTerm     2735.23     (11.3%)     2683.84      (8.4%)   -1.9% ( -19% -   20%)
                HighSloppyPhrase      450.25     (17.9%)      442.55     (14.0%)   -1.7% ( -28% -   36%)
                          Fuzzy1      217.03      (4.7%)      213.65      (4.6%)   -1.6% ( -10% -    8%)
                         MedTerm     2938.48      (6.9%)     2898.85     (10.1%)   -1.3% ( -17% -   16%)
                        PKLookup      223.36     (13.1%)      220.68     (15.6%)   -1.2% ( -26% -   31%)
                         LowTerm     3151.97      (7.9%)     3121.46      (9.4%)   -1.0% ( -16% -   17%)
                         Respell      270.39      (6.3%)      268.33      (7.7%)   -0.8% ( -13% -   14%)
                 LowSloppyPhrase      926.43      (8.3%)      920.97      (5.9%)   -0.6% ( -13% -   14%)
                       MedPhrase     1011.40      (6.5%)     1006.45      (5.5%)   -0.5% ( -11% -   12%)
                     MedSpanNear      694.00     (16.2%)      691.67     (10.4%)   -0.3% ( -23% -   31%)
                       OrHighMed      646.39      (7.9%)      644.37      (6.5%)   -0.3% ( -13% -   15%)
                      HighPhrase      709.21      (5.9%)      710.47      (7.2%)    0.2% ( -12% -   14%)
                      AndHighLow     1303.45      (6.3%)     1308.80     (12.4%)    0.4% ( -17% -   20%)
                       LowPhrase      643.77      (8.6%)      646.81     (12.9%)    0.5% ( -19% -   24%)
                     AndHighHigh      949.19      (9.0%)      956.77      (8.2%)    0.8% ( -14% -   19%)
                    HighSpanNear      624.54     (16.6%)      636.03     (13.2%)    1.8% ( -24% -   38%)
                 MedSloppyPhrase     1006.97      (7.0%)     1035.01      (9.1%)    2.8% ( -12% -   20%)
                     LowSpanNear      986.22      (9.4%)     1050.32      (8.7%)    6.5% ( -10% -   27%)
        

        Any variation here seems to be just noise. I think this is ready?

        Show
        Alan Woodward added a comment - - edited TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff Wildcard 1407.35 (12.4%) 1334.61 (15.1%) -5.2% ( -29% - 25%) Prefix3 1987.02 (16.6%) 1894.70 (14.2%) -4.6% ( -30% - 31%) IntNRQ 566.03 (11.0%) 544.01 (14.8%) -3.9% ( -26% - 24%) AndHighMed 1158.92 (8.9%) 1118.93 (11.1%) -3.5% ( -21% - 18%) Fuzzy2 55.93 (12.5%) 54.23 (10.4%) -3.0% ( -23% - 22%) OrHighHigh 615.32 (8.2%) 598.05 (7.5%) -2.8% ( -17% - 14%) OrHighLow 751.47 (7.1%) 736.36 (7.6%) -2.0% ( -15% - 13%) HighTerm 2735.23 (11.3%) 2683.84 (8.4%) -1.9% ( -19% - 20%) HighSloppyPhrase 450.25 (17.9%) 442.55 (14.0%) -1.7% ( -28% - 36%) Fuzzy1 217.03 (4.7%) 213.65 (4.6%) -1.6% ( -10% - 8%) MedTerm 2938.48 (6.9%) 2898.85 (10.1%) -1.3% ( -17% - 16%) PKLookup 223.36 (13.1%) 220.68 (15.6%) -1.2% ( -26% - 31%) LowTerm 3151.97 (7.9%) 3121.46 (9.4%) -1.0% ( -16% - 17%) Respell 270.39 (6.3%) 268.33 (7.7%) -0.8% ( -13% - 14%) LowSloppyPhrase 926.43 (8.3%) 920.97 (5.9%) -0.6% ( -13% - 14%) MedPhrase 1011.40 (6.5%) 1006.45 (5.5%) -0.5% ( -11% - 12%) MedSpanNear 694.00 (16.2%) 691.67 (10.4%) -0.3% ( -23% - 31%) OrHighMed 646.39 (7.9%) 644.37 (6.5%) -0.3% ( -13% - 15%) HighPhrase 709.21 (5.9%) 710.47 (7.2%) 0.2% ( -12% - 14%) AndHighLow 1303.45 (6.3%) 1308.80 (12.4%) 0.4% ( -17% - 20%) LowPhrase 643.77 (8.6%) 646.81 (12.9%) 0.5% ( -19% - 24%) AndHighHigh 949.19 (9.0%) 956.77 (8.2%) 0.8% ( -14% - 19%) HighSpanNear 624.54 (16.6%) 636.03 (13.2%) 1.8% ( -24% - 38%) MedSloppyPhrase 1006.97 (7.0%) 1035.01 (9.1%) 2.8% ( -12% - 20%) LowSpanNear 986.22 (9.4%) 1050.32 (8.7%) 6.5% ( -10% - 27%) Any variation here seems to be just noise. I think this is ready?
        Hide
        David Smiley added a comment -

        Seems like noise.

        Can you please answer this question?:

        What is the purpose of the start & end position params to collectLeaf()? No implementation uses them (on consumer or implementer side) and I'm not sure how they might be used.

        Show
        David Smiley added a comment - Seems like noise. Can you please answer this question?: What is the purpose of the start & end position params to collectLeaf()? No implementation uses them (on consumer or implementer side) and I'm not sure how they might be used.
        Hide
        Alan Woodward added a comment -

        They were left over from a different API design that didn't work out. They're gone in the latest patch.

        Show
        Alan Woodward added a comment - They were left over from a different API design that didn't work out. They're gone in the latest patch.
        Hide
        David Smiley added a comment -

        Ok; I'm +1 to this patch.

        I guess it's "progress no perfection" on not moving to BytesRefArray yet.

        Show
        David Smiley added a comment - Ok; I'm +1 to this patch. I guess it's "progress no perfection" on not moving to BytesRefArray yet.
        Hide
        Paul Elschot added a comment -

        NearSpansOrdered has this comment in shrinkToAfterShortestMatch():

              /* Do not break on (matchSlop > allowedSlop) here to make sure
               * that on return the first subSpans has nextStartPosition called.
               */
        

        That also implies that no payloads need to be collected any more once matchSlop > allowedSlop.
        I don't know whether that will actually help a lot in practice, but if someone wants to give it a try...

        Show
        Paul Elschot added a comment - NearSpansOrdered has this comment in shrinkToAfterShortestMatch(): /* Do not break on (matchSlop > allowedSlop) here to make sure * that on return the first subSpans has nextStartPosition called. */ That also implies that no payloads need to be collected any more once matchSlop > allowedSlop. I don't know whether that will actually help a lot in practice, but if someone wants to give it a try...
        Hide
        ASF subversion and git services added a comment -

        Commit 1680205 from Alan Woodward in branch 'dev/trunk'
        [ https://svn.apache.org/r1680205 ]

        LUCENE-6371: Add collection API to Spans, remove payload methods

        Show
        ASF subversion and git services added a comment - Commit 1680205 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1680205 ] LUCENE-6371 : Add collection API to Spans, remove payload methods
        Hide
        ASF subversion and git services added a comment -

        Commit 1680206 from Alan Woodward in branch 'dev/trunk'
        [ https://svn.apache.org/r1680206 ]

        LUCENE-6371: Update CHANGES.txt

        Show
        ASF subversion and git services added a comment - Commit 1680206 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1680206 ] LUCENE-6371 : Update CHANGES.txt
        Hide
        ASF subversion and git services added a comment -

        Commit 1680208 from Alan Woodward in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1680208 ]

        LUCENE-6371: Add collection API to Spans, remove payload methods

        Show
        ASF subversion and git services added a comment - Commit 1680208 from Alan Woodward in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1680208 ] LUCENE-6371 : Add collection API to Spans, remove payload methods
        Hide
        Alan Woodward added a comment -

        Thanks for the reviews, everyone!

        I've marked the new classes and methods as lucene.experimental, rather than moving to the sandbox - if anyone feels strongly about that, maybe it could be done in a follow up issue.

        Show
        Alan Woodward added a comment - Thanks for the reviews, everyone! I've marked the new classes and methods as lucene.experimental, rather than moving to the sandbox - if anyone feels strongly about that, maybe it could be done in a follow up issue.
        Hide
        ASF subversion and git services added a comment -

        Commit 1680220 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1680220 ]

        LUCENE-6371: Fix SpanPayloadCheckQuery import.

        Show
        ASF subversion and git services added a comment - Commit 1680220 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1680220 ] LUCENE-6371 : Fix SpanPayloadCheckQuery import.
        Hide
        Adrien Grand added a comment -

        I was just trying to update some Spans code and had some compile errors because of this new span collection API. Nothing bad so far but this made me read more about the new APIs and I'm a bit worried about the introduced complexity (SpanCollector, buffer(), bufferedCollector(), etc.). Maybe that's just me not being familiar enough with spans but is the introduced complexity worth the improvements? I have not looked into details yet but if we need more complex handling for only some queries maybe this should move to sandbox as suggested above to keep at least core simple?

        Show
        Adrien Grand added a comment - I was just trying to update some Spans code and had some compile errors because of this new span collection API. Nothing bad so far but this made me read more about the new APIs and I'm a bit worried about the introduced complexity (SpanCollector, buffer(), bufferedCollector(), etc.). Maybe that's just me not being familiar enough with spans but is the introduced complexity worth the improvements? I have not looked into details yet but if we need more complex handling for only some queries maybe this should move to sandbox as suggested above to keep at least core simple?
        Hide
        Robert Muir added a comment -

        Adrien, I have similar concerns. Actually Alan cleaned things up quite a bit on LUCENE-6494, so I left my comments there.

        I think it should either be a blocker, or we back out this change from just the 5.x branch and do it for 5.3 so it has time to iterate and get better.

        Show
        Robert Muir added a comment - Adrien, I have similar concerns. Actually Alan cleaned things up quite a bit on LUCENE-6494 , so I left my comments there. I think it should either be a blocker, or we back out this change from just the 5.x branch and do it for 5.3 so it has time to iterate and get better.
        Hide
        Anshum Gupta added a comment -

        I'll revert r1680208 and r1680220 from branch_5x and change the fix version to 5.3 for this.

        Show
        Anshum Gupta added a comment - I'll revert r1680208 and r1680220 from branch_5x and change the fix version to 5.3 for this.
        Hide
        Robert Muir added a comment -

        Reopening, see LUCENE-6494

        Show
        Robert Muir added a comment - Reopening, see LUCENE-6494
        Hide
        ASF subversion and git services added a comment -

        Commit 1680978 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1680978 ]

        LUCENE-6371, LUCENE-6466: back out from 5.2, see https://issues.apache.org/jira/browse/LUCENE-6494

        Show
        ASF subversion and git services added a comment - Commit 1680978 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1680978 ] LUCENE-6371 , LUCENE-6466 : back out from 5.2, see https://issues.apache.org/jira/browse/LUCENE-6494
        Hide
        Alan Woodward added a comment -

        Here's a patch taking into account all the comments here and on LUCENE-6494.

        • SpanCollector becomes an interface again, so payload collection is entirely defined in the .payloads package
        • BufferedSpanCollector is removed, replaced by a simple array of SpanCollectors in NearSpansOrdered. SpanCollector has two methods to deal with this, newSubCollectors() and collectedComposite(), to create and then replay.
        • SpanCollectors are passed through in getSpans(). A null passed here means no collection, and there's a default getSpans() call on SpanWeight that always passes a null collector.
        • I've removed SpanSimilarity, in favour of passing a map of Terms to TermContexts to the SpanWeight constructor. If this is null, then scoring isn't required; if not, then SpanWeight builds a SimScorer and passes that to its scorer.
        Show
        Alan Woodward added a comment - Here's a patch taking into account all the comments here and on LUCENE-6494 . SpanCollector becomes an interface again, so payload collection is entirely defined in the .payloads package BufferedSpanCollector is removed, replaced by a simple array of SpanCollectors in NearSpansOrdered. SpanCollector has two methods to deal with this, newSubCollectors() and collectedComposite(), to create and then replay. SpanCollectors are passed through in getSpans(). A null passed here means no collection, and there's a default getSpans() call on SpanWeight that always passes a null collector. I've removed SpanSimilarity, in favour of passing a map of Terms to TermContexts to the SpanWeight constructor. If this is null, then scoring isn't required; if not, then SpanWeight builds a SimScorer and passes that to its scorer.
        Hide
        Robert Muir added a comment -

        I've removed SpanSimilarity, in favour of passing a map of Terms to TermContexts to the SpanWeight constructor. If this is null, then scoring isn't required; if not, then SpanWeight builds a SimScorer and passes that to its scorer.

        I still don't understand what the purpose here was/is. We already have a parameter needsScores for Query.createWeight. Why must there be another way to do the same thing? Even a null seems like duplication of what already exists.

        Show
        Robert Muir added a comment - I've removed SpanSimilarity, in favour of passing a map of Terms to TermContexts to the SpanWeight constructor. If this is null, then scoring isn't required; if not, then SpanWeight builds a SimScorer and passes that to its scorer. I still don't understand what the purpose here was/is. We already have a parameter needsScores for Query.createWeight. Why must there be another way to do the same thing? Even a null seems like duplication of what already exists.
        Hide
        Alan Woodward added a comment -

        It's because of the way scoring works on Spans. We build a span tree, calling .getSpans() on the various weights as we go down the hierarchy, but scoring is only done by the top level, so we only want to build a Similarity in the top-level weight. And because the Similarity is built in the constructor, we need to collect all the terms and termcontexts of the various leaves before the Weight itself is built, so we can't just pass needsScores.

        Ideally scoring would be done on the Spans themselves (making them even more like just a specialized Scorer), but that's a bigger change.

        Show
        Alan Woodward added a comment - It's because of the way scoring works on Spans. We build a span tree, calling .getSpans() on the various weights as we go down the hierarchy, but scoring is only done by the top level, so we only want to build a Similarity in the top-level weight. And because the Similarity is built in the constructor, we need to collect all the terms and termcontexts of the various leaves before the Weight itself is built, so we can't just pass needsScores. Ideally scoring would be done on the Spans themselves (making them even more like just a specialized Scorer), but that's a bigger change.
        Hide
        Robert Muir added a comment -

        But what does it buy? I don't like having two apis to accomplish the same thing. In this case i don't think we really save anything.

        Show
        Robert Muir added a comment - But what does it buy? I don't like having two apis to accomplish the same thing. In this case i don't think we really save anything.
        Hide
        Alan Woodward added a comment -

        I'm not sure I'm following what you mean by having two APIs. SpanQuery.createWeight() has exactly the same signature as Query.createWeight() - replacing needsScores with the termcontexts map in the SpanWeight constructor just allows us to pass in the needed information to build the Similarity at the same time as indicating whether or not it should be built at all.

        Show
        Alan Woodward added a comment - I'm not sure I'm following what you mean by having two APIs. SpanQuery.createWeight() has exactly the same signature as Query.createWeight() - replacing needsScores with the termcontexts map in the SpanWeight constructor just allows us to pass in the needed information to build the Similarity at the same time as indicating whether or not it should be built at all.
        Hide
        Robert Muir added a comment -

        Sorry, I am not very clear. I don't understand why SpanWeight needs to take nulls, or TermContext stuff at all.

        Why can't it work like today? If someone Span* impl needs to know when scoring is not needed, add boolean needsScores to SpanWeight's ctor, rather than some magic null value. TermContexts map is an impl detail, not an API. Lets not let it become one, its too messy for that.

        Show
        Robert Muir added a comment - Sorry, I am not very clear. I don't understand why SpanWeight needs to take nulls, or TermContext stuff at all. Why can't it work like today? If someone Span* impl needs to know when scoring is not needed, add boolean needsScores to SpanWeight's ctor, rather than some magic null value. TermContexts map is an impl detail, not an API. Lets not let it become one, its too messy for that.
        Hide
        Alan Woodward added a comment -

        Ah, ok. This is more to do with LUCENE-6466. Moving .getSpans() and .extractTerms() to SpanWeight means that we can't collect the termcontexts in the constructor any more, because we can't call them on a partially constructed object. And the TermContexts map was already in the API, but it was in getSpans() rather than in the SpanWeight constructor. You're right that it's messy though. I'm just not sure there's a cleaner way of doing it that doesn't involve completely changing how the SpanScorer works.

        Show
        Alan Woodward added a comment - Ah, ok. This is more to do with LUCENE-6466 . Moving .getSpans() and .extractTerms() to SpanWeight means that we can't collect the termcontexts in the constructor any more, because we can't call them on a partially constructed object. And the TermContexts map was already in the API, but it was in getSpans() rather than in the SpanWeight constructor. You're right that it's messy though. I'm just not sure there's a cleaner way of doing it that doesn't involve completely changing how the SpanScorer works.
        Hide
        Robert Muir added a comment -

        But that does not explain the goal. I want to know the reasoning, what is the tradeoff, what are we gaining by having LUCENE-6466 changes? Thats the big thing I am missing.

        Are there performance improvements in the benchmark? Otherwise I would rather not have stuff like SpanSimilarity or null TermContexts maps in ctors, unless its really needed for some reason.

        Show
        Robert Muir added a comment - But that does not explain the goal. I want to know the reasoning, what is the tradeoff, what are we gaining by having LUCENE-6466 changes? Thats the big thing I am missing. Are there performance improvements in the benchmark? Otherwise I would rather not have stuff like SpanSimilarity or null TermContexts maps in ctors, unless its really needed for some reason.
        Hide
        Alan Woodward added a comment -

        It's the same logic as https://issues.apache.org/jira/browse/LUCENE-6425 really. Spans should only be pulled from a SpanQuery after it has been rewritten against a searcher, so it makes sense that they should be on SpanWeight rather than SpanQuery. And it simplifies pulling Spans for use in things like highlighting (or payload collection, as here) - before you had to rewrite, extract terms, build the termcontexts map and then call getSpans(); now you just call IndexSearcher.getNormalizedWeight(), cast to a SpanWeight and call getSpans.

        Show
        Alan Woodward added a comment - It's the same logic as https://issues.apache.org/jira/browse/LUCENE-6425 really. Spans should only be pulled from a SpanQuery after it has been rewritten against a searcher, so it makes sense that they should be on SpanWeight rather than SpanQuery. And it simplifies pulling Spans for use in things like highlighting (or payload collection, as here) - before you had to rewrite, extract terms, build the termcontexts map and then call getSpans(); now you just call IndexSearcher.getNormalizedWeight(), cast to a SpanWeight and call getSpans.
        Hide
        Robert Muir added a comment -

        Well the only span it impacts is SpanMultiTermQueryWrapper right? I'm to blame for that monstrosity, i just wanted to cleanup the crazy slow custom regex queries at the time, there were hard to maintain.

        But maybe we should move it to sandbox and its just documented to be iffy with highlighting. I dunno, i see the spans as being for the expert proximity use case, and i don't think its a great idea we have that thing in core.

        Show
        Robert Muir added a comment - Well the only span it impacts is SpanMultiTermQueryWrapper right? I'm to blame for that monstrosity, i just wanted to cleanup the crazy slow custom regex queries at the time, there were hard to maintain. But maybe we should move it to sandbox and its just documented to be iffy with highlighting. I dunno, i see the spans as being for the expert proximity use case, and i don't think its a great idea we have that thing in core.
        Hide
        Alan Woodward added a comment -

        I think it's still useful though - I use it all the time! It would be nice if you could restrict the number of SpanOr clauses it rewrites to, but that's a separate issue.

        If you really think that moving .getSpans() and .extractTerms() to SpanWeight doesn't gain anything, then I can back it out. But I think it does simplify the API and brings it more into line with our other standard queries. And I really don't see that exposing the termcontexts map on the SpanWeight constructor is any worse than exposing it directly in .getSpans(). In fact, I'd say that it's hiding it better - very few users of lucene are going to be looking at SpanWeights, as they're an implementation detail, but anyone using an IDE is going to be shown SpanQuery.getSpans() when they try and autocomplete on a SpanQuery object, and it's not something that most users need to worry about.

        Show
        Alan Woodward added a comment - I think it's still useful though - I use it all the time! It would be nice if you could restrict the number of SpanOr clauses it rewrites to, but that's a separate issue. If you really think that moving .getSpans() and .extractTerms() to SpanWeight doesn't gain anything, then I can back it out. But I think it does simplify the API and brings it more into line with our other standard queries. And I really don't see that exposing the termcontexts map on the SpanWeight constructor is any worse than exposing it directly in .getSpans(). In fact, I'd say that it's hiding it better - very few users of lucene are going to be looking at SpanWeights, as they're an implementation detail, but anyone using an IDE is going to be shown SpanQuery.getSpans() when they try and autocomplete on a SpanQuery object, and it's not something that most users need to worry about.
        Hide
        Robert Muir added a comment -

        I think it's still useful though - I use it all the time!

        Yeah but its slow with no easy chance of ever being faster. There is no simple bitset rewrite here like there is for other multiterm queries. Additionally It has all the downsides of an enormous boolean query, but with proximity to boot: and this is very real, even simple stuff like 1-2 KB RAM consumption per term due to additional decompression buffers for prox. Maybe in the future you could optionally index prefix terms, but I can't imagine merging proximity etc into a prefix-field for full-indexed-fields as a default, seems complicated and slow and space-consuming.

        It would be nice if you could restrict the number of SpanOr clauses it rewrites to, but that's a separate issue.

        +1, that is a great idea. We should really both do that and also add warnings to the javadocs about inefficiency. It has none today!

        If you really think that moving .getSpans() and .extractTerms() to SpanWeight doesn't gain anything, then I can back it out. But I think it does simplify the API and brings it more into line with our other standard queries.

        I totally agree it has the value of consistency with other queries. But some of the APIs trying to do this are fairly complicated, yet at the same time still not really working: see below for more explanation.

        And I really don't see that exposing the termcontexts map on the SpanWeight constructor is any worse than exposing it directly in .getSpans(). In fact, I'd say that it's hiding it better - very few users of lucene are going to be looking at SpanWeights, as they're an implementation detail, but anyone using an IDE is going to be shown SpanQuery.getSpans() when they try and autocomplete on a SpanQuery object, and it's not something that most users need to worry about.

        Its actually terrible already: the motivation for this stuff being to try to speedup the turtle in question, SpanMultiTermQuery. The reason this stuff was exposed, is because it could bring some relief to such crazy queries, by only visiting each term in the term dictionary less than 3 times (rewrite, weight/idf, postings). But this was never quite right for two reasons:

        • Leniency: We can't enforce we are doing the performant thing because creation of weight/idf uses extractTerms(). So the SpanTermWeight inside the exclude portion of a SpanNot suddenly sees an unexpected term it has no termstate for. Maybe patches here removed this problem, but forgot to fix the leniency in SpanTermWeight, as I see at least the code comment is gone.
        • Incomplete: SpanMultiTermQueryWrapper still isn't reusing the termcontext from rewrite(), somehow passing it down to the rewritten-spans. So the whole ugly thing isn't even totally working, its just reducing the number of visits to the term dictionary from 3 down to 2, but it is stupid that it is not 1.
        Show
        Robert Muir added a comment - I think it's still useful though - I use it all the time! Yeah but its slow with no easy chance of ever being faster. There is no simple bitset rewrite here like there is for other multiterm queries. Additionally It has all the downsides of an enormous boolean query, but with proximity to boot: and this is very real, even simple stuff like 1-2 KB RAM consumption per term due to additional decompression buffers for prox. Maybe in the future you could optionally index prefix terms, but I can't imagine merging proximity etc into a prefix-field for full-indexed-fields as a default, seems complicated and slow and space-consuming. It would be nice if you could restrict the number of SpanOr clauses it rewrites to, but that's a separate issue. +1, that is a great idea. We should really both do that and also add warnings to the javadocs about inefficiency. It has none today! If you really think that moving .getSpans() and .extractTerms() to SpanWeight doesn't gain anything, then I can back it out. But I think it does simplify the API and brings it more into line with our other standard queries. I totally agree it has the value of consistency with other queries. But some of the APIs trying to do this are fairly complicated, yet at the same time still not really working: see below for more explanation. And I really don't see that exposing the termcontexts map on the SpanWeight constructor is any worse than exposing it directly in .getSpans(). In fact, I'd say that it's hiding it better - very few users of lucene are going to be looking at SpanWeights, as they're an implementation detail, but anyone using an IDE is going to be shown SpanQuery.getSpans() when they try and autocomplete on a SpanQuery object, and it's not something that most users need to worry about. Its actually terrible already: the motivation for this stuff being to try to speedup the turtle in question, SpanMultiTermQuery. The reason this stuff was exposed, is because it could bring some relief to such crazy queries, by only visiting each term in the term dictionary less than 3 times (rewrite, weight/idf, postings). But this was never quite right for two reasons: Leniency: We can't enforce we are doing the performant thing because creation of weight/idf uses extractTerms(). So the SpanTermWeight inside the exclude portion of a SpanNot suddenly sees an unexpected term it has no termstate for. Maybe patches here removed this problem, but forgot to fix the leniency in SpanTermWeight, as I see at least the code comment is gone. Incomplete: SpanMultiTermQueryWrapper still isn't reusing the termcontext from rewrite(), somehow passing it down to the rewritten-spans. So the whole ugly thing isn't even totally working, its just reducing the number of visits to the term dictionary from 3 down to 2, but it is stupid that it is not 1.
        Hide
        Robert Muir added a comment -

        By the way, those two go together. Its hard to fix that slow query to do the right thing and pass along contexts from rewrite while having the leniency, too scary and undertested.

        At the same time, we don't want this scary SpanMultiTermQuery to corrupt all of our span apis. The optimization in question won't fix fundamental performance problems with it, and its basically the only one with a problem.

        These are the reasons why the original termcontext stuff for spans was a half-ass implementation, all those problems and very ugly as well, but relatively simple and contained. The half-ass implementation worked completely (seeks reduced from 2 to 1) for all normal span queries, just not the crazy SpanMultiTermQuery. So it was a tradeoff for simplicity, and works well for all the regular spans use cases, so they were more inline with e.g. sloppyphrasequery and so on.

        Show
        Robert Muir added a comment - By the way, those two go together. Its hard to fix that slow query to do the right thing and pass along contexts from rewrite while having the leniency, too scary and undertested. At the same time, we don't want this scary SpanMultiTermQuery to corrupt all of our span apis. The optimization in question won't fix fundamental performance problems with it, and its basically the only one with a problem. These are the reasons why the original termcontext stuff for spans was a half-ass implementation, all those problems and very ugly as well, but relatively simple and contained. The half-ass implementation worked completely (seeks reduced from 2 to 1) for all normal span queries, just not the crazy SpanMultiTermQuery. So it was a tradeoff for simplicity, and works well for all the regular spans use cases, so they were more inline with e.g. sloppyphrasequery and so on.
        Hide
        Alan Woodward added a comment -

        I think leniency should be fixed now, because the TermContext for each leaf is built by SpanTermQuery.createWeight(), and then collected for IDF via a new SpanWeight.extractTermContexts() method, rather than being built by the parent Weight via extractTerms(). So there should only be one visit to the terms dictionary per term in normal use.

        There's still an extra visit in SpanMTQWrapper, but I think we can fix that by adding a SpanTermQuery(Term, TermContext) constructor, like we have with the standard TermQuery. Maybe we should carry this over to LUCENE-6466, as here it's getting mixed up with the span collection API, which is a separate thing. I'll put up a patch.

        Show
        Alan Woodward added a comment - I think leniency should be fixed now, because the TermContext for each leaf is built by SpanTermQuery.createWeight(), and then collected for IDF via a new SpanWeight.extractTermContexts() method, rather than being built by the parent Weight via extractTerms(). So there should only be one visit to the terms dictionary per term in normal use. There's still an extra visit in SpanMTQWrapper, but I think we can fix that by adding a SpanTermQuery(Term, TermContext) constructor, like we have with the standard TermQuery. Maybe we should carry this over to LUCENE-6466 , as here it's getting mixed up with the span collection API, which is a separate thing. I'll put up a patch.
        Hide
        Robert Muir added a comment -

        There's still an extra visit in SpanMTQWrapper, but I think we can fix that by adding a SpanTermQuery(Term, TermContext) constructor, like we have with the standard TermQuery. Maybe we should carry this over to LUCENE-6466, as here it's getting mixed up with the span collection API, which is a separate thing. I'll put up a patch.

        OK, this sounds good to me, because it would at least be consistent with TermQuery.

        Maybe we should carry this over to LUCENE-6466, as here it's getting mixed up with the span collection API, which is a separate thing. I'll put up a patch.

        OK I agree, lets not try to tackle it all at once in one patch. Lets just fix trunk until we are happy on whatever issues we need. Then we backport everything to 5.x for 5.3 here.

        I have the feeling its really not that far away (your cleanups already here addressed a lot of my concerns), but it would be good to make sure the api changes support what we need. Fixing this MTQ termcontext stuff would a great improvement.

        Just as a reminder I am still concerned about LUCENE-6495, which is test failures introduced from LUCENE-6466 (somehow scoring is changed, and only when using java 7!!!). This makes life tricky because trunk requires java 8 so we can't easily dig in. But maybe we can just do a lot of beasting before backporting the whole thing and try to figure that one out at that time.

        Show
        Robert Muir added a comment - There's still an extra visit in SpanMTQWrapper, but I think we can fix that by adding a SpanTermQuery(Term, TermContext) constructor, like we have with the standard TermQuery. Maybe we should carry this over to LUCENE-6466 , as here it's getting mixed up with the span collection API, which is a separate thing. I'll put up a patch. OK, this sounds good to me, because it would at least be consistent with TermQuery. Maybe we should carry this over to LUCENE-6466 , as here it's getting mixed up with the span collection API, which is a separate thing. I'll put up a patch. OK I agree, lets not try to tackle it all at once in one patch. Lets just fix trunk until we are happy on whatever issues we need. Then we backport everything to 5.x for 5.3 here. I have the feeling its really not that far away (your cleanups already here addressed a lot of my concerns), but it would be good to make sure the api changes support what we need. Fixing this MTQ termcontext stuff would a great improvement. Just as a reminder I am still concerned about LUCENE-6495 , which is test failures introduced from LUCENE-6466 (somehow scoring is changed, and only when using java 7!!!). This makes life tricky because trunk requires java 8 so we can't easily dig in. But maybe we can just do a lot of beasting before backporting the whole thing and try to figure that one out at that time.
        Hide
        Alan Woodward added a comment -

        Could it be because SpanWeight was previously using a TreeMap to collect terms, which was enforcing an ordering? I'm a bit confused by how it would affect things, though, because the test that failed was running the exact same query in different ways, which would suggest that Java 7 was iterating over the exact same map non-deterministically.

        Show
        Alan Woodward added a comment - Could it be because SpanWeight was previously using a TreeMap to collect terms, which was enforcing an ordering? I'm a bit confused by how it would affect things, though, because the test that failed was running the exact same query in different ways, which would suggest that Java 7 was iterating over the exact same map non-deterministically.
        Hide
        Robert Muir added a comment -

        Definitely sounds like a possibility, though usually those hashmap ordering bugs never reproduce for me, because we can't tell the JVM the "seed". In this case they reproduce at least. The problem in those tests is that scores get inconsistent with explains, and that is a strange way for it show up.

        Show
        Robert Muir added a comment - Definitely sounds like a possibility, though usually those hashmap ordering bugs never reproduce for me, because we can't tell the JVM the "seed". In this case they reproduce at least. The problem in those tests is that scores get inconsistent with explains, and that is a strange way for it show up.
        Hide
        Alan Woodward added a comment -

        Patch updated to trunk.

        Show
        Alan Woodward added a comment - Patch updated to trunk.
        Hide
        Alan Woodward added a comment -

        I'd like to commit this today, and backport this and LUCENE-6466 to 5.x (with some beasting first to see if changing to an ordered map has fixed LUCENE-6495)

        Show
        Alan Woodward added a comment - I'd like to commit this today, and backport this and LUCENE-6466 to 5.x (with some beasting first to see if changing to an ordered map has fixed LUCENE-6495 )
        Hide
        Adrien Grand added a comment -

        It seems like the latest patch only contains changes for lucene/core (which makes eg. lucene/test-framework fail to compile). Also NearSpansOrdered still mentions BufferedSpanCollector in the javadocs.

        Given that NearSpansOrder is to blame for API complexity here since it is the only Spans impl that consumes sub spans eagerly, should we consider removing it entirely?

        Show
        Adrien Grand added a comment - It seems like the latest patch only contains changes for lucene/core (which makes eg. lucene/test-framework fail to compile). Also NearSpansOrdered still mentions BufferedSpanCollector in the javadocs. Given that NearSpansOrder is to blame for API complexity here since it is the only Spans impl that consumes sub spans eagerly, should we consider removing it entirely?
        Hide
        Alan Woodward added a comment -

        The patch again, this time taken from the correct point in the source tree :-/ I've fixed the javadoc comment as well.

        should we consider removing it entirely?

        I don't think so, it's a pretty fundamental operation. One way of simplifying it might be to make SpanCollector final, and have it collect either everything or nothing, so that creating subcollectors is easier. But that then makes it difficult to move payload collection out of core. Or maybe instead we could make SpanCollector implement Cloneable, and move the responsibility of building subcollectors directly into NearSpansOrdered?

        Show
        Alan Woodward added a comment - The patch again, this time taken from the correct point in the source tree :-/ I've fixed the javadoc comment as well. should we consider removing it entirely? I don't think so, it's a pretty fundamental operation. One way of simplifying it might be to make SpanCollector final, and have it collect either everything or nothing, so that creating subcollectors is easier. But that then makes it difficult to move payload collection out of core. Or maybe instead we could make SpanCollector implement Cloneable, and move the responsibility of building subcollectors directly into NearSpansOrdered?
        Hide
        Adrien Grand added a comment -

        Or maybe instead we could make SpanCollector implement Cloneable, and move the responsibility of building subcollectors directly into NearSpansOrdered?

        This sounds better to me, but I'd like to see if we could avoid having to implement Cloneable too.

        I tried to look more at the way NearSpansOrdered is implemented and it seems to me that the only reason why it needs to consume sub spans eagerly is to be able to make the match as short as possible. Maybe we could drop this requirement and make NearSpansOrdered iterate lazily like NearSpansUnordered? SpanNearQuery does not even mention this feature in its javadocs.

        Show
        Adrien Grand added a comment - Or maybe instead we could make SpanCollector implement Cloneable, and move the responsibility of building subcollectors directly into NearSpansOrdered? This sounds better to me, but I'd like to see if we could avoid having to implement Cloneable too. I tried to look more at the way NearSpansOrdered is implemented and it seems to me that the only reason why it needs to consume sub spans eagerly is to be able to make the match as short as possible. Maybe we could drop this requirement and make NearSpansOrdered iterate lazily like NearSpansUnordered? SpanNearQuery does not even mention this feature in its javadocs.
        Hide
        Alan Woodward added a comment -

        That might work, as long as NSO still reports all the matches, rather than just the first one it encounters. It will change scoring a bit, but that shouldn't be too much of a problem. I'll work on a patch.

        Show
        Alan Woodward added a comment - That might work, as long as NSO still reports all the matches, rather than just the first one it encounters. It will change scoring a bit, but that shouldn't be too much of a problem. I'll work on a patch.
        Hide
        Paul Elschot added a comment -

        It might be possible to add peekNextStartPosition() and peekNextEndPosition() to Spans and then use those in NSO to leave the subspans at their positions, and still detect the shortest match non lazily. With the subspans at their matching positions the payloads could be collected easily.
        That would require a peekNextPositionMethod() in PostingsEnum, and that is no problem in Lucene50PositingsReader. However, it would still be quite a bit of work (probably too much work) to get these extra methods implemented in all Spans.

        To make NSO iterate lazily, what else could be done after stretchToOrder() than shrinkToAfterShortestMatch() ?

        Show
        Paul Elschot added a comment - It might be possible to add peekNextStartPosition() and peekNextEndPosition() to Spans and then use those in NSO to leave the subspans at their positions, and still detect the shortest match non lazily. With the subspans at their matching positions the payloads could be collected easily. That would require a peekNextPositionMethod() in PostingsEnum, and that is no problem in Lucene50PositingsReader. However, it would still be quite a bit of work (probably too much work) to get these extra methods implemented in all Spans. To make NSO iterate lazily, what else could be done after stretchToOrder() than shrinkToAfterShortestMatch() ?
        Hide
        Adrien Grand added a comment -

        However, it would still be quite a bit of work

        Right, iterators that can peek are tricky to implement, I'd rather avoid that.

        To make NSO iterate lazily, what else could be done after stretchToOrder() than shrinkToAfterShortestMatch() ?

        I think we should only call stretchToOrder() as there is nothing better we can do without having information about further positions? Advancing any of the sub spans could make the span unordered or larger than the slop.

        Show
        Adrien Grand added a comment - However, it would still be quite a bit of work Right, iterators that can peek are tricky to implement, I'd rather avoid that. To make NSO iterate lazily, what else could be done after stretchToOrder() than shrinkToAfterShortestMatch() ? I think we should only call stretchToOrder() as there is nothing better we can do without having information about further positions? Advancing any of the sub spans could make the span unordered or larger than the slop.
        Hide
        Paul Elschot added a comment -

        I think we should only call stretchToOrder() as there is nothing better we can do without having information about further positions? Advancing any of the sub spans could make the span unordered or larger than the slop.

        After stretchToOrder(), there is not necessarily a match.
        So maybe shrinkToFirstMatch() first from the first subspans to the last but one (instead of now shrinkToAfterShortest the other way around.)
        And then collect payloads of matching spans until after the shortest match?

        Show
        Paul Elschot added a comment - I think we should only call stretchToOrder() as there is nothing better we can do without having information about further positions? Advancing any of the sub spans could make the span unordered or larger than the slop. After stretchToOrder(), there is not necessarily a match. So maybe shrinkToFirstMatch() first from the first subspans to the last but one (instead of now shrinkToAfterShortest the other way around.) And then collect payloads of matching spans until after the shortest match?
        Hide
        Adrien Grand added a comment -

        After stretchToOrder(), there is not necessarily a match.

        Indeed: my idea was to have a for loop where we first advance the first spans to the next position, then call stretchToOrder(). Then two cases: either it doesn't match and the for-loop continues, or it is a match and we return. It will return a superset of today's matches of NearSpansOrdered.

        So maybe shrinkToFirstMatch() first from the first subspans to the last but one (instead of now shrinkToAfterShortest the other way around.)

        I don't see how it would help with the eager advancing issue, ie. if you try to advance a sub span to minimize the distance and it goes too far, there is no way to go back?

        Show
        Adrien Grand added a comment - After stretchToOrder(), there is not necessarily a match. Indeed: my idea was to have a for loop where we first advance the first spans to the next position, then call stretchToOrder(). Then two cases: either it doesn't match and the for-loop continues, or it is a match and we return. It will return a superset of today's matches of NearSpansOrdered. So maybe shrinkToFirstMatch() first from the first subspans to the last but one (instead of now shrinkToAfterShortest the other way around.) I don't see how it would help with the eager advancing issue, ie. if you try to advance a sub span to minimize the distance and it goes too far, there is no way to go back?
        Hide
        Alan Woodward added a comment -

        Here's a patch that makes NearSpansOrdered non-lazy in the way Adrien described, and simplifies the SpanCollector accordingly.

        Should I break out the changes to NearSpansOrdered into their own issue? It seems like a big enough change in its own right, really.

        Show
        Alan Woodward added a comment - Here's a patch that makes NearSpansOrdered non-lazy in the way Adrien described, and simplifies the SpanCollector accordingly. Should I break out the changes to NearSpansOrdered into their own issue? It seems like a big enough change in its own right, really.
        Hide
        Adrien Grand added a comment -

        Thanks Alan, I think this is much simpler now. NearSpansOrder still refers to SpanCollector.newSubCollectors(), I think we can remove it now.

        Should I break out the changes to NearSpansOrdered into their own issue? It seems like a big enough change in its own right, really.

        This makes sense to me.

        Show
        Adrien Grand added a comment - Thanks Alan, I think this is much simpler now. NearSpansOrder still refers to SpanCollector.newSubCollectors() , I think we can remove it now. Should I break out the changes to NearSpansOrdered into their own issue? It seems like a big enough change in its own right, really. This makes sense to me.
        Hide
        Alan Woodward added a comment -

        I opened LUCENE-6537

        Show
        Alan Woodward added a comment - I opened LUCENE-6537
        Hide
        Alan Woodward added a comment -

        Updated patch, following on from LUCENE-6537. I'd like to commit this, then backport LUCENE-6490 and LUCENE-6537.

        Show
        Alan Woodward added a comment - Updated patch, following on from LUCENE-6537 . I'd like to commit this, then backport LUCENE-6490 and LUCENE-6537 .
        Hide
        ASF subversion and git services added a comment -

        Commit 1684701 from Alan Woodward in branch 'dev/trunk'
        [ https://svn.apache.org/r1684701 ]

        LUCENE-6371: Move CHANGES entry to 5.3

        Show
        ASF subversion and git services added a comment - Commit 1684701 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1684701 ] LUCENE-6371 : Move CHANGES entry to 5.3
        Hide
        Alan Woodward added a comment -

        Here's a patch for 5x. What with reversions and overlapping commits, it turns out the easiest thing to do was to merge the patches for LUCENE-6466, LUCENE-6537 and LUCENE-6371 into one.

        All tests are passing, but I want to beast this against Java 7 for a bit to check that LUCENE-6490 is fixed.

        Show
        Alan Woodward added a comment - Here's a patch for 5x. What with reversions and overlapping commits, it turns out the easiest thing to do was to merge the patches for LUCENE-6466 , LUCENE-6537 and LUCENE-6371 into one. All tests are passing, but I want to beast this against Java 7 for a bit to check that LUCENE-6490 is fixed.
        Hide
        Alan Woodward added a comment -

        Smaller patch, after LUCENE-6466 and LUCENE-6537

        Show
        Alan Woodward added a comment - Smaller patch, after LUCENE-6466 and LUCENE-6537
        Hide
        ASF subversion and git services added a comment -

        Commit 1685577 from Alan Woodward in branch 'dev/trunk'
        [ https://svn.apache.org/r1685577 ]

        LUCENE-6371: Turn off test verbosity

        Show
        ASF subversion and git services added a comment - Commit 1685577 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1685577 ] LUCENE-6371 : Turn off test verbosity
        Hide
        ASF subversion and git services added a comment -

        Commit 1685578 from Alan Woodward in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1685578 ]

        LUCENE-6371: Add SpanCollector

        Show
        ASF subversion and git services added a comment - Commit 1685578 from Alan Woodward in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1685578 ] LUCENE-6371 : Add SpanCollector
        Hide
        Alan Woodward added a comment -

        Thanks everyone!

        Show
        Alan Woodward added a comment - Thanks everyone!
        Hide
        ASF subversion and git services added a comment -

        Commit 1685614 from Alan Woodward in branch 'dev/trunk'
        [ https://svn.apache.org/r1685614 ]

        LUCENE-6371: Remove deprecated SpanNearPayloadCheckQuery

        Show
        ASF subversion and git services added a comment - Commit 1685614 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1685614 ] LUCENE-6371 : Remove deprecated SpanNearPayloadCheckQuery
        Hide
        ASF subversion and git services added a comment -

        Commit 1685814 from Alan Woodward in branch 'dev/trunk'
        [ https://svn.apache.org/r1685814 ]

        LUCENE-6371: Remove unused 'collectPayloads' parameter from SpanNearQuery

        Show
        ASF subversion and git services added a comment - Commit 1685814 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1685814 ] LUCENE-6371 : Remove unused 'collectPayloads' parameter from SpanNearQuery
        Hide
        ASF subversion and git services added a comment -

        Commit 1685825 from Alan Woodward in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1685825 ]

        LUCENE-6371: Deprecate 'collectPayloads' parameter in SpanNearQuery

        Show
        ASF subversion and git services added a comment - Commit 1685825 from Alan Woodward in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1685825 ] LUCENE-6371 : Deprecate 'collectPayloads' parameter in SpanNearQuery
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Alan Woodward
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development