Lucene - Core
  1. Lucene - Core
  2. LUCENE-6276

Add matchCost() api to TwoPhaseDocIdSetIterator

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We could add a method like TwoPhaseDISI.matchCost() defined as something like estimate of nanoseconds or similar.

      ConjunctionScorer could use this method to sort its 'twoPhaseIterators' array so that cheaper ones are called first. Today it has no idea if one scorer is a simple phrase scorer on a short field vs another that might do some geo calculation or more expensive stuff.

      PhraseScorers could implement this based on index statistics (e.g. totalTermFreq/maxDoc)

      1. LUCENE-6276.patch
        40 kB
        Paul Elschot
      2. LUCENE-6276.patch
        39 kB
        Paul Elschot
      3. LUCENE-6276.patch
        38 kB
        Paul Elschot
      4. LUCENE-6276.patch
        36 kB
        Paul Elschot
      5. LUCENE-6276.patch
        35 kB
        Paul Elschot
      6. LUCENE-6276.patch
        35 kB
        Paul Elschot
      7. LUCENE-6276.patch
        29 kB
        Paul Elschot
      8. LUCENE-6276.patch
        21 kB
        Paul Elschot
      9. LUCENE-6276-ExactPhraseOnly.patch
        10 kB
        Paul Elschot
      10. LUCENE-6276-NoSpans.patch
        19 kB
        Paul Elschot
      11. LUCENE-6276-NoSpans2.patch
        18 kB
        Paul Elschot

        Activity

        Hide
        Adrien Grand added a comment -

        I like the idea. I'm curious if you already have concrete ideas for the match costs of our existing queries? Maybe it should not only measure the cost of the operation but also how likely it is to match? This would make sloppy phrases more "costly" since they are more lenient about positions and thus more likely to match.

        Show
        Adrien Grand added a comment - I like the idea. I'm curious if you already have concrete ideas for the match costs of our existing queries? Maybe it should not only measure the cost of the operation but also how likely it is to match? This would make sloppy phrases more "costly" since they are more lenient about positions and thus more likely to match.
        Hide
        Robert Muir added a comment -

        I'm curious if you already have concrete ideas for the match costs of our existing queries?

        See above in the description. we know the average number of positions per doc (totalTermFreq/docFreq) and so on. So we can compute the amortized cost of reading one position, and its easy from there.

        Maybe it should not only measure the cost of the operation but also how likely it is to match?

        I don't agree. You can already get this with Scorer.getApproximation().cost()/Scorer.cost().

        Show
        Robert Muir added a comment - I'm curious if you already have concrete ideas for the match costs of our existing queries? See above in the description. we know the average number of positions per doc (totalTermFreq/docFreq) and so on. So we can compute the amortized cost of reading one position, and its easy from there. Maybe it should not only measure the cost of the operation but also how likely it is to match? I don't agree. You can already get this with Scorer.getApproximation().cost()/Scorer.cost().
        Hide
        Paul Elschot added a comment -

        From the javadocs of DocIdSetIterator.cost():
        This is generally an upper bound of the number of documents this iterator might match, but may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.

        Perhaps this cost method can be renamed to for example expectedMaxMatchingDocs() with these javadocs:
        This an expected upper bound of the number of documents this iterator might match.

        Would it make sense to put matchCost() at DocIdSetIterator?

        Show
        Paul Elschot added a comment - From the javadocs of DocIdSetIterator.cost(): This is generally an upper bound of the number of documents this iterator might match, but may be a rough heuristic, hardcoded value, or otherwise completely inaccurate. Perhaps this cost method can be renamed to for example expectedMaxMatchingDocs() with these javadocs: This an expected upper bound of the number of documents this iterator might match. Would it make sense to put matchCost() at DocIdSetIterator?
        Hide
        Adrien Grand added a comment -

        Perhaps this cost method can be renamed to for example expectedMaxMatchingDocs() with these javadocs: This an expected upper bound of the number of documents this iterator might match.

        Personally I don't dislike "cost". Even if it does not carry very well the meaning of what it measures, it does a pretty good job at carrying what to do with the result of this method: if you have several iterators, you want to consume the least-costly ones first.

        Would it make sense to put matchCost() at DocIdSetIterator?

        The DocIdSetIterator abstraction does not have the concept of "matching", only TwoPhaseIterator has it, so I think it would be awkward to have it on DocIdSetIterator? TwoPhaseIterator feels like a more appropriate place to have this method.

        Show
        Adrien Grand added a comment - Perhaps this cost method can be renamed to for example expectedMaxMatchingDocs() with these javadocs: This an expected upper bound of the number of documents this iterator might match. Personally I don't dislike "cost". Even if it does not carry very well the meaning of what it measures, it does a pretty good job at carrying what to do with the result of this method: if you have several iterators, you want to consume the least-costly ones first. Would it make sense to put matchCost() at DocIdSetIterator? The DocIdSetIterator abstraction does not have the concept of "matching", only TwoPhaseIterator has it, so I think it would be awkward to have it on DocIdSetIterator? TwoPhaseIterator feels like a more appropriate place to have this method.
        Hide
        Paul Elschot added a comment -

        As to TwoPhaseIterator or DocIdSetIterator, I think this boils down to whether the leading iterator in ConjunctionDISI should be chosen using the expected number of matching docs only, or also using the totalTermFreq's somehow. This is for more complex queries, for example a conjunction with at least one phrase or SpanNearQuery.

        But for the more complex queries two phase approximation is already in place, so having matchCost() only in the two phase code could be enough even for these queries.

        Show
        Paul Elschot added a comment - As to TwoPhaseIterator or DocIdSetIterator, I think this boils down to whether the leading iterator in ConjunctionDISI should be chosen using the expected number of matching docs only, or also using the totalTermFreq's somehow. This is for more complex queries, for example a conjunction with at least one phrase or SpanNearQuery. But for the more complex queries two phase approximation is already in place, so having matchCost() only in the two phase code could be enough even for these queries.
        Hide
        Paul Elschot added a comment -

        Patch of 12 Oct 2015, starting matchCost for ExactPhraseScorer only, the rest does not compile because it still needs matchDoc.
        (A little bit too much whitespace was removed by the editor, please ignore the noise.)

        Is this the direction to take?

        Show
        Paul Elschot added a comment - Patch of 12 Oct 2015, starting matchCost for ExactPhraseScorer only, the rest does not compile because it still needs matchDoc. (A little bit too much whitespace was removed by the editor, please ignore the noise.) Is this the direction to take?
        Hide
        Adrien Grand added a comment -

        I think it would make more sense to sum up totalTermFreq/docFreq for each term instead of totalTermFreq/conjunctionDISI.cost(), so that we get the average number of positions per document? But otherwise I think you got the intention right. Something else to be careful with is that TermStatistics.totalTermFreq() may return -1, so we need a fallback for that case. Maybe we could just assume 1 position per document?

        A related question is what definition we should give to matchCost(). The patch does not have the issue yet since it only deals with phrase queries, but eventually we should be able to compare the cost of eg. a phrase query against a doc values range query even though they perform very different computations. Maybe the javadocs of matchCost could suggest a scale of costs of operations that implementors of matchCost() could use in order to compute the cost of matching the two-phase iterator. It could be something like 1 for nextDoc(), nextPosition(), comparisons and basic arithmetic operations and eg. 10 for advance()?

        Show
        Adrien Grand added a comment - I think it would make more sense to sum up totalTermFreq/docFreq for each term instead of totalTermFreq/conjunctionDISI.cost() , so that we get the average number of positions per document? But otherwise I think you got the intention right. Something else to be careful with is that TermStatistics.totalTermFreq() may return -1, so we need a fallback for that case. Maybe we could just assume 1 position per document? A related question is what definition we should give to matchCost() . The patch does not have the issue yet since it only deals with phrase queries, but eventually we should be able to compare the cost of eg. a phrase query against a doc values range query even though they perform very different computations. Maybe the javadocs of matchCost could suggest a scale of costs of operations that implementors of matchCost() could use in order to compute the cost of matching the two-phase iterator. It could be something like 1 for nextDoc(), nextPosition(), comparisons and basic arithmetic operations and eg. 10 for advance()?
        Hide
        Robert Muir added a comment -

        As to TwoPhaseIterator or DocIdSetIterator, I think this boils down to whether the leading iterator in ConjunctionDISI should be chosen using the expected number of matching docs only, or also using the totalTermFreq's somehow. This is for more complex queries, for example a conjunction with at least one phrase or SpanNearQuery.

        But for the more complex queries two phase approximation is already in place, so having matchCost() only in the two phase code could be enough even for these queries.

        Yes, to keep things simple, I imagined this api would just be the cost of calling matches() itself so I think the two phase API is the correct place to put it (like in your patch).

        We already have a cost() api for DISI for doing things like conjunctions (yes its purely based on density and maybe that is imperfect) but I think we should try to narrow the scope of this issue to just the cost of the matches() operation, which can vary wildly depending on query type or document size.

        What adrien says about "likelyhood of match" is also interesting but I think we want to defer that too. To me that is just a matter of having more accurate cost() and it may not be easy or feasible to improve...

        Show
        Robert Muir added a comment - As to TwoPhaseIterator or DocIdSetIterator, I think this boils down to whether the leading iterator in ConjunctionDISI should be chosen using the expected number of matching docs only, or also using the totalTermFreq's somehow. This is for more complex queries, for example a conjunction with at least one phrase or SpanNearQuery. But for the more complex queries two phase approximation is already in place, so having matchCost() only in the two phase code could be enough even for these queries. Yes, to keep things simple, I imagined this api would just be the cost of calling matches() itself so I think the two phase API is the correct place to put it (like in your patch). We already have a cost() api for DISI for doing things like conjunctions (yes its purely based on density and maybe that is imperfect) but I think we should try to narrow the scope of this issue to just the cost of the matches() operation, which can vary wildly depending on query type or document size. What adrien says about "likelyhood of match" is also interesting but I think we want to defer that too. To me that is just a matter of having more accurate cost() and it may not be easy or feasible to improve...
        Hide
        Paul Elschot added a comment -

        it would make more sense to sum up totalTermFreq/docFreq for each term

        I'll change that and change the matchCost() method to return a float instead of a long.

        TermStatistics.totalTermFreq() may return -1

        I'll add a check for that.

        what definition we should give to matchCost()

        I'd like to have it reflect an avarage cost to process a single document, once the two phase iterator is at the document.
        That would exclude the cost for next() and advance(), which would be better in the DISI.cost() method for now.

        How much of the cost of matches() should be in there I don't know, we'll see. NearSpans also does work after matches() returns true.

        And the likelyhood of match is the probability that matches() returns true...

        Show
        Paul Elschot added a comment - it would make more sense to sum up totalTermFreq/docFreq for each term I'll change that and change the matchCost() method to return a float instead of a long. TermStatistics.totalTermFreq() may return -1 I'll add a check for that. what definition we should give to matchCost() I'd like to have it reflect an avarage cost to process a single document, once the two phase iterator is at the document. That would exclude the cost for next() and advance(), which would be better in the DISI.cost() method for now. How much of the cost of matches() should be in there I don't know, we'll see. NearSpans also does work after matches() returns true. And the likelyhood of match is the probability that matches() returns true...
        Hide
        Adrien Grand added a comment -

        change the matchCost() method to return a float instead of a long

        I liked having it as a long, like DISI.cost(). Maybe we could just round?

        I'd like to have it reflect an avarage cost to process a single document, once the two phase iterator is at the document. That would exclude the cost for next() and advance(), which would be better in the DISI.cost() method for now.

        Indeed this is what it should do! Sorry I introduced some confusion, the reason why I brought these methods is ReqExclScorer, whose TwoPhaseIterator calls DocIdSetIterator.advance() on the excluded iterator in oder to validate a match. So we need to decide how costly calling advance() is.

        Show
        Adrien Grand added a comment - change the matchCost() method to return a float instead of a long I liked having it as a long, like DISI.cost(). Maybe we could just round? I'd like to have it reflect an avarage cost to process a single document, once the two phase iterator is at the document. That would exclude the cost for next() and advance(), which would be better in the DISI.cost() method for now. Indeed this is what it should do! Sorry I introduced some confusion, the reason why I brought these methods is ReqExclScorer, whose TwoPhaseIterator calls DocIdSetIterator.advance() on the excluded iterator in oder to validate a match. So we need to decide how costly calling advance() is.
        Hide
        Paul Elschot added a comment -

        Patch of 13 Oct 2015. No spans yet.
        Left matchDoc() returning float because in many cases the avarage number of positions in a matching document will be close to 1.
        Quite a few nocommits at matchDoc implementations throwing an Error("not yet implemented")

        This includes a first attempt at sorting the DISI's in ConjunctionDISI.

        To my surprise, quite a few tests pass, I have not yet tried all of them.

        Show
        Paul Elschot added a comment - Patch of 13 Oct 2015. No spans yet. Left matchDoc() returning float because in many cases the avarage number of positions in a matching document will be close to 1. Quite a few nocommits at matchDoc implementations throwing an Error("not yet implemented") This includes a first attempt at sorting the DISI's in ConjunctionDISI. To my surprise, quite a few tests pass, I have not yet tried all of them.
        Hide
        Adrien Grand added a comment -

        The change in ConjunctionDISI does not look right to me: we should keep sorting the iterators based on DISI.cost, and only use TwoPhaseIterator.matchCost to sort TwoPhaseConjunctionDISI.twoPhaseIterators.

        I'm also unhappy about adding a method to TermStatistics, this class should remain as simple as possible. Can we make it private to PhraseWeight?

        Show
        Adrien Grand added a comment - The change in ConjunctionDISI does not look right to me: we should keep sorting the iterators based on DISI.cost, and only use TwoPhaseIterator.matchCost to sort TwoPhaseConjunctionDISI.twoPhaseIterators . I'm also unhappy about adding a method to TermStatistics, this class should remain as simple as possible. Can we make it private to PhraseWeight?
        Hide
        Paul Elschot added a comment -

        ... unhappy about adding a method to TermStatistics, this class should remain as simple as possible. Can we make it private to PhraseWeight?

        Why should TermStatistics remain as simple as possible? Having a method that returns an expected value in a ...Statistics class looks just right to me.

        I initially had the code in PhraseWeight, but there many getter methods from TermStatistics were used, so I moved the method to TermStatistics and used the return value as the match cost for a single term in PhraseWeight.

        I think we will need the same thing for spans (multiplied with a factor 4 or so), and in that case the method will have to be public, because the spans are in a different package. Can we reconsider moving the method until the spans are done here?

        ... only use TwoPhaseIterator.matchCost to sort TwoPhaseConjunctionDISI.twoPhaseIterators.

        I missed that, I'll introduce it.

        Show
        Paul Elschot added a comment - ... unhappy about adding a method to TermStatistics, this class should remain as simple as possible. Can we make it private to PhraseWeight? Why should TermStatistics remain as simple as possible? Having a method that returns an expected value in a ...Statistics class looks just right to me. I initially had the code in PhraseWeight, but there many getter methods from TermStatistics were used, so I moved the method to TermStatistics and used the return value as the match cost for a single term in PhraseWeight. I think we will need the same thing for spans (multiplied with a factor 4 or so), and in that case the method will have to be public, because the spans are in a different package. Can we reconsider moving the method until the spans are done here? ... only use TwoPhaseIterator.matchCost to sort TwoPhaseConjunctionDISI.twoPhaseIterators. I missed that, I'll introduce it.
        Hide
        Paul Elschot added a comment -

        Second patch of 13 Oct 2015:
        Use matchCost to sort twoPhaseIterators.
        Add matchCost implementations in test code.
        Rename method expPositionsPerDoc() to expTermFreqInMatchingDoc().

        Show
        Paul Elschot added a comment - Second patch of 13 Oct 2015: Use matchCost to sort twoPhaseIterators. Add matchCost implementations in test code. Rename method expPositionsPerDoc() to expTermFreqInMatchingDoc().
        Hide
        Paul Elschot added a comment -

        matchCost is still not implemented for Spans (4 nocommits left), and now some test cases using Spans actually fail.

        Show
        Paul Elschot added a comment - matchCost is still not implemented for Spans (4 nocommits left), and now some test cases using Spans actually fail.
        Hide
        Paul Elschot added a comment -

        We could take into account the different costs of advance() and nextDoc(), but at another issue.
        With cost() as an estimation of the number of matching documents, as it is now:
        for conjunctions that could become: 2 * (minimum cost()) * (cost of advance),
        and for disjunctions: (total cost()) * (cost of nextDoc).

        ReqExclScorer could use the cost of advance in its matchCost already here, but I have no idea which value to use.

        Show
        Paul Elschot added a comment - We could take into account the different costs of advance() and nextDoc(), but at another issue. With cost() as an estimation of the number of matching documents, as it is now: for conjunctions that could become: 2 * (minimum cost()) * (cost of advance), and for disjunctions: (total cost()) * (cost of nextDoc). ReqExclScorer could use the cost of advance in its matchCost already here, but I have no idea which value to use.
        Hide
        Adrien Grand added a comment -

        TermStatistics is a class that we need to maintain backward compatibility for since it's not experimental/internal. So we shouldn't put more methods in there that we only need for implementation details of PhraseQuery/SpanNearQuery. I would rather duplicate the logic. In addition, the current implementation of this method is trappy as it assumes that the average term freq is 1 when totalTermFreq is not available. While this might be ok for the matchCost computation of phrase queries, it might not be for other use-cases.

        The changes in ConjunctionDISI look good to me now, thanks.

        +            if (w.twoPhaseView != null) {
        +              matchCost += w.twoPhaseView.matchCost();
        +            } else {
        +              assert w.iterator instanceof TermScorer; // zero match cost.
        +            }
        

        w.twoPhaseView can be null on any scorer that does not expose an approximation. So it can be not only a TermScorer, but also a conjunction/disjunction of term scorers or even a custom query.

        Show
        Adrien Grand added a comment - TermStatistics is a class that we need to maintain backward compatibility for since it's not experimental/internal. So we shouldn't put more methods in there that we only need for implementation details of PhraseQuery/SpanNearQuery. I would rather duplicate the logic. In addition, the current implementation of this method is trappy as it assumes that the average term freq is 1 when totalTermFreq is not available. While this might be ok for the matchCost computation of phrase queries, it might not be for other use-cases. The changes in ConjunctionDISI look good to me now, thanks. + if (w.twoPhaseView != null ) { + matchCost += w.twoPhaseView.matchCost(); + } else { + assert w.iterator instanceof TermScorer; // zero match cost. + } w.twoPhaseView can be null on any scorer that does not expose an approximation. So it can be not only a TermScorer, but also a conjunction/disjunction of term scorers or even a custom query.
        Hide
        Paul Elschot added a comment -

        TermStatistics.java has a @lucene.experimental javadoc in trunk.

        I'll remove the assert w.iterator instanceof TermScorer, I put it there to remind me to check what to do in other cases.

        Show
        Paul Elschot added a comment - TermStatistics.java has a @lucene.experimental javadoc in trunk. I'll remove the assert w.iterator instanceof TermScorer , I put it there to remind me to check what to do in other cases.
        Hide
        Paul Elschot added a comment -

        Patch of 14 October 2015.
        No more NOCOMMITS, existing tests pass.
        Still no tests to verify that matchCost() is used correctly.

        Some FIXME's for the cost values used.
        Improve javadoc for TermStatistics.expTermFreqInMatchingDoc.
        In AssertingTwoPhaseView.matchCost() the cost should be non negative.
        Small javadoc correction in TwoPhaseIterator.

        Show
        Paul Elschot added a comment - Patch of 14 October 2015. No more NOCOMMITS, existing tests pass. Still no tests to verify that matchCost() is used correctly. Some FIXME's for the cost values used. Improve javadoc for TermStatistics.expTermFreqInMatchingDoc. In AssertingTwoPhaseView.matchCost() the cost should be non negative. Small javadoc correction in TwoPhaseIterator.
        Hide
        Robert Muir added a comment -

        I agree with Adrien here. TermStatistics and CollectionStatistics are what feed the scoring systems (and already hairy enough as is), so we should keep any of this optimization-related stuff out of them. They were added to allow IndexSearcher to support distributed search.

        Show
        Robert Muir added a comment - I agree with Adrien here. TermStatistics and CollectionStatistics are what feed the scoring systems (and already hairy enough as is), so we should keep any of this optimization-related stuff out of them. They were added to allow IndexSearcher to support distributed search.
        Hide
        Paul Elschot added a comment - - edited

        2nd patch of 15 October 2015.

        This adds Span.positionsCost() as the basic matchCost to be used for Spans.
        This method has a NOCOMMIT in Spans.java: throw UOE or abstract in Spans?
        I'd prefer to throw an UOE but an abstract method is easier to find the places where positionsCost() really needs to be implemented.
        For now I left it abstract and have some implementations throw UOE.

        Other changes to the previous patch:
        Use TermStatistics from trunk.
        Move expTermFreqInMatchingDoc() from TermStatistics into PhraseWeight and a copy into TermSpans.
        Simplified matchCost() implementations.

        Existing tests pass.

        This will need improvements, and I hope it works decently for simple cases like a conjunction over a phrase and a SpanNear.

        Show
        Paul Elschot added a comment - - edited 2nd patch of 15 October 2015. This adds Span.positionsCost() as the basic matchCost to be used for Spans. This method has a NOCOMMIT in Spans.java: throw UOE or abstract in Spans? I'd prefer to throw an UOE but an abstract method is easier to find the places where positionsCost() really needs to be implemented. For now I left it abstract and have some implementations throw UOE. Other changes to the previous patch: Use TermStatistics from trunk. Move expTermFreqInMatchingDoc() from TermStatistics into PhraseWeight and a copy into TermSpans. Simplified matchCost() implementations. Existing tests pass. This will need improvements, and I hope it works decently for simple cases like a conjunction over a phrase and a SpanNear.
        Hide
        Paul Elschot added a comment -

        Some TwoPhaseIterators in Solr will need to have matchCost() added with the latest patch.
        I am not familiar enough with Solr code for that.

        Show
        Paul Elschot added a comment - Some TwoPhaseIterators in Solr will need to have matchCost() added with the latest patch. I am not familiar enough with Solr code for that.
        Hide
        Paul Elschot added a comment -

        Patch of 18 Oct 2015.

        • Calculate the matchCost per LeafReaderContext, because the sorting by matchCost is done at leaf level.
        • In the facet, join and spatial modules, add matchCost implementations returning 0, and with a CHECKME comment.
        • Fixed a bug in the earlier added added sort comparator in the TwoPhaseConjunctionDISI constructor, it was comparing the same object.
        • Dropped the NOCOMMIT for Spans.positionsCost(), pefer this to be an abstract method.
        • In AssertingSpans the positionCost should be positive.
        • Add some toString() implementations for inline subclasses of TwoPhaseIterator to ease debugging.
        Show
        Paul Elschot added a comment - Patch of 18 Oct 2015. Calculate the matchCost per LeafReaderContext, because the sorting by matchCost is done at leaf level. In the facet, join and spatial modules, add matchCost implementations returning 0, and with a CHECKME comment. Fixed a bug in the earlier added added sort comparator in the TwoPhaseConjunctionDISI constructor, it was comparing the same object. Dropped the NOCOMMIT for Spans.positionsCost(), pefer this to be an abstract method. In AssertingSpans the positionCost should be positive. Add some toString() implementations for inline subclasses of TwoPhaseIterator to ease debugging.
        Hide
        Paul Elschot added a comment - - edited

        2nd patch of 18 Oct 2015. More improvements:

        • Add TwoPhaseIterator.termPositionsCost() for use in matchCost() implementations. This replaces the earlier PhraseQuery.expTermFreqInMatchingDoc().
        • In SpanOrQuery use better avarage cost for matchCost().
        • Move positionsCost() throwing UOE from Near/ContainSpans to ConjunctionSpans only.
        • matchCost must be non negative in AssertingScorer.
        • Use a random matchCost in RandomApproximationQuery.
        • Better javadocs and code comments.
        Show
        Paul Elschot added a comment - - edited 2nd patch of 18 Oct 2015. More improvements: Add TwoPhaseIterator.termPositionsCost() for use in matchCost() implementations. This replaces the earlier PhraseQuery.expTermFreqInMatchingDoc(). In SpanOrQuery use better avarage cost for matchCost(). Move positionsCost() throwing UOE from Near/ContainSpans to ConjunctionSpans only. matchCost must be non negative in AssertingScorer. Use a random matchCost in RandomApproximationQuery. Better javadocs and code comments.
        Hide
        David Smiley added a comment -

        This is neat. Couple things...

        • RandomAccessWeight's 2-phase should probably call to a protected method on RAW so that a subclass could define the cost based on how expensive using the Bits is. Maybe it should be abstract and not zero?
        • It will be difficult for many of the 2-phase implementations to calculate a matchCost – particularly the ones not based on the number of term positions. What to do? A constant of '0' (which you often labelled FIXME) is way too cheap I think? Maybe the best we can hope for is simply a stable sort of same-cost 2-phases and assuming that the order of queries added to a BooleanQuery.Builder remains stable. This at least allows the user to tune performance by changing the clause order for such constant matchCost queries. But I see that the latest BooleanQuery.Builder is not stable due to use of HashSet / MultiSet versus LinkedHashSet which would be stable. What do you think Adrien Grand? Alternatively (or in addition) a query wrapper could allow explicitly setting a cost (vaguely similar to Solr's ExtendedQuery.getCost). This could look similar to BoostQuery but for 2-phase matchCost instead of score.
        • Maybe the explain could possibly display the matchCost? It'd be nice to troubleshoot/inspect for diagnostics somehow. Not critical, of course.
        Show
        David Smiley added a comment - This is neat. Couple things... RandomAccessWeight's 2-phase should probably call to a protected method on RAW so that a subclass could define the cost based on how expensive using the Bits is. Maybe it should be abstract and not zero? It will be difficult for many of the 2-phase implementations to calculate a matchCost – particularly the ones not based on the number of term positions. What to do? A constant of '0' (which you often labelled FIXME) is way too cheap I think? Maybe the best we can hope for is simply a stable sort of same-cost 2-phases and assuming that the order of queries added to a BooleanQuery.Builder remains stable. This at least allows the user to tune performance by changing the clause order for such constant matchCost queries. But I see that the latest BooleanQuery.Builder is not stable due to use of HashSet / MultiSet versus LinkedHashSet which would be stable. What do you think Adrien Grand ? Alternatively (or in addition) a query wrapper could allow explicitly setting a cost (vaguely similar to Solr's ExtendedQuery.getCost). This could look similar to BoostQuery but for 2-phase matchCost instead of score. Maybe the explain could possibly display the matchCost? It'd be nice to troubleshoot/inspect for diagnostics somehow. Not critical, of course.
        Hide
        Paul Elschot added a comment -

        I left the matchCosts that I could not easily determine at zero and added a CHECKME. This is more an indication that refinement is possible.

        Sorting subscorers/subspans by cost and matchCost is probably better than relying on any given order.
        Anyway I don't expect the impact of matchCost on performance be more than 4-8% except maybe for really complex queries.

        Showing the matchCost in explain will be tricky because it is computed by LeafReaderContext, i.e. by segment.

        The matchCost is not yet used for the second phase in disjunctions. Yet another priority queue might be needed for that, so I'd prefer to delay that to another issue.

        Show
        Paul Elschot added a comment - I left the matchCosts that I could not easily determine at zero and added a CHECKME. This is more an indication that refinement is possible. Sorting subscorers/subspans by cost and matchCost is probably better than relying on any given order. Anyway I don't expect the impact of matchCost on performance be more than 4-8% except maybe for really complex queries. Showing the matchCost in explain will be tricky because it is computed by LeafReaderContext, i.e. by segment. The matchCost is not yet used for the second phase in disjunctions. Yet another priority queue might be needed for that, so I'd prefer to delay that to another issue.
        Hide
        Paul Elschot added a comment - - edited

        Another thing to be determined is this the relative cost of span queries vs phrase queries.
        The code for that is in SpanTermQuery here:

          /** A guess of
           * the relative cost of dealing with the term positions
           * when using a SpanNearQuery instead of a PhraseQuery.
           */
          private final float PHRASE_TO_SPAN_TERM_POSITIONS_COST = 4.0f;
        

        This is a guess because it is only based on my recollection of a few years ago that the performance of PhraseQuery was about 4 times better than an ordered SpanNear.
        In the long term it is probably better to make this a configurable parameter.

        Show
        Paul Elschot added a comment - - edited Another thing to be determined is this the relative cost of span queries vs phrase queries. The code for that is in SpanTermQuery here: /** A guess of * the relative cost of dealing with the term positions * when using a SpanNearQuery instead of a PhraseQuery. */ private final float PHRASE_TO_SPAN_TERM_POSITIONS_COST = 4.0f; This is a guess because it is only based on my recollection of a few years ago that the performance of PhraseQuery was about 4 times better than an ordered SpanNear. In the long term it is probably better to make this a configurable parameter.
        Hide
        Paul Elschot added a comment -

        Talking about priority queues, there is also this one: LUCENE-6453.

        Show
        Paul Elschot added a comment - Talking about priority queues, there is also this one: LUCENE-6453 .
        Hide
        Adrien Grand added a comment -

        It will be difficult for many of the 2-phase implementations to calculate a matchCost – particularly the ones not based on the number of term positions. What to do?

        Agreed: we need to come with a very simple definition of matchCost that could be applied regardless of how matches() is implemented. I think we have two options:

        • either an estimate running time of matches() in nanoseconds,
        • or an average number of operations that need to be performed in matches(), so that you would add +1 every time you do a comparison, arithmetic operation, consume a PostingsEnum, etc.

        Runtimes in nanoseconds could easily vary depending on hardware, JVM version, etc. so I think the 2nd option is more practical. For instance:

        • for a phrase query, we would return the sums of the average number of positions per documents (which is an estimate of how many times you will call PostingsEnum.nextPosition()). Maybe we could try to fold in the cost of balancing the priority queue too.
        • for a doc values range query on numbers, the match cost would be 3: one dv lookup and 2 comparisons
        • for a geo distance query that uses SloppyMath.haversin to confirm matches, we could easily count how many operations are performed by SloppyMath.haversin

        This is simplistic but I think it would do the job and keep the implementation simple. For instance, a doc values range query would always be confirmed before a geo-distance query.

        But I see that the latest BooleanQuery.Builder is not stable due to use of HashSet / MultiSet versus LinkedHashSet which would be stable. What do you think Adrien Grand?

        Actually it is: those sets and multisets are only used for equals/hashcode. The creation of scorers is still based on the list of clauses, which maintains the order from the builder.

        Showing the matchCost in explain will be tricky because it is computed by LeafReaderContext, i.e. by segment.

        +1 to not do it

        The matchCost is not yet used for the second phase in disjunctions. Yet another priority queue might be needed for that, so I'd prefer to delay that to another issue.

        Feel free to delay, I plan to explore this in LUCENE-6815.

        Show
        Adrien Grand added a comment - It will be difficult for many of the 2-phase implementations to calculate a matchCost – particularly the ones not based on the number of term positions. What to do? Agreed: we need to come with a very simple definition of matchCost that could be applied regardless of how matches() is implemented. I think we have two options: either an estimate running time of matches() in nanoseconds, or an average number of operations that need to be performed in matches(), so that you would add +1 every time you do a comparison, arithmetic operation, consume a PostingsEnum, etc. Runtimes in nanoseconds could easily vary depending on hardware, JVM version, etc. so I think the 2nd option is more practical. For instance: for a phrase query, we would return the sums of the average number of positions per documents (which is an estimate of how many times you will call PostingsEnum.nextPosition()). Maybe we could try to fold in the cost of balancing the priority queue too. for a doc values range query on numbers, the match cost would be 3: one dv lookup and 2 comparisons for a geo distance query that uses SloppyMath.haversin to confirm matches, we could easily count how many operations are performed by SloppyMath.haversin This is simplistic but I think it would do the job and keep the implementation simple. For instance, a doc values range query would always be confirmed before a geo-distance query. But I see that the latest BooleanQuery.Builder is not stable due to use of HashSet / MultiSet versus LinkedHashSet which would be stable. What do you think Adrien Grand? Actually it is: those sets and multisets are only used for equals/hashcode. The creation of scorers is still based on the list of clauses, which maintains the order from the builder. Showing the matchCost in explain will be tricky because it is computed by LeafReaderContext, i.e. by segment. +1 to not do it The matchCost is not yet used for the second phase in disjunctions. Yet another priority queue might be needed for that, so I'd prefer to delay that to another issue. Feel free to delay, I plan to explore this in LUCENE-6815 .
        Hide
        David Smiley added a comment -

        RE BooleanQuery stable ordering: thanks for correcting me; I'm very glad it's stable. At least this gives the user some control.

        I think we may need to percolate the matchCost concept further into other APIs – namely ValueSource/FunctionValues. This way the 2-phase iterator can enclose one of them to fetch the matchCost of it when composing it's aggregate matchCost.

        And I question if a numeric DV lookup or reading a posting is equivalent to one mathematical operation since those things involve some code behind them that aren't cheap one-liners that a math operation are. Nonetheless, I get the concept you are suggesting.

        I kind of like the time based approach you suggested better but what's needed is some automation to aid in establishing a baseline such that someone on their own machine can get an approximation slower/faster factor multiplier compared to some baseline server. Like what if there was an ant target that ran some test based on Wikipedia data that established that the matchCost for a some PhraseQuery on the machine it's run on is ____ nanoseconds. Then the output also displays some hard-coded value that we got when running this on some baseline server. Dividing the two yields a relative difference between the local machine and the baseline server. Then when working on a custom query, I could locally temporarily either modify the timing test to test my new query (perhaps plucking the geo data in Wikipedia out to a lat-lon spatial field) or timing it in my own way. Then I apply the multiplier to determine which number to hard-code into the query going into Lucene. Make sense?

        There is another aspect of the cost beyond a per-postings iteration cost. The cost of reading the N+1 position is generally going to be much cheaper than reading the very first position since the first once possibly involves a seek. If only postings based queries are being compared via matchCost, this is a wash since all of them have this cost but it'd be different for a doc-values based query. Although perhaps it's a wash there too – assume one disk seek? Worst case of course.

        Show
        David Smiley added a comment - RE BooleanQuery stable ordering: thanks for correcting me; I'm very glad it's stable. At least this gives the user some control. I think we may need to percolate the matchCost concept further into other APIs – namely ValueSource/FunctionValues. This way the 2-phase iterator can enclose one of them to fetch the matchCost of it when composing it's aggregate matchCost. And I question if a numeric DV lookup or reading a posting is equivalent to one mathematical operation since those things involve some code behind them that aren't cheap one-liners that a math operation are. Nonetheless, I get the concept you are suggesting. I kind of like the time based approach you suggested better but what's needed is some automation to aid in establishing a baseline such that someone on their own machine can get an approximation slower/faster factor multiplier compared to some baseline server. Like what if there was an ant target that ran some test based on Wikipedia data that established that the matchCost for a some PhraseQuery on the machine it's run on is ____ nanoseconds. Then the output also displays some hard-coded value that we got when running this on some baseline server. Dividing the two yields a relative difference between the local machine and the baseline server. Then when working on a custom query, I could locally temporarily either modify the timing test to test my new query (perhaps plucking the geo data in Wikipedia out to a lat-lon spatial field) or timing it in my own way. Then I apply the multiplier to determine which number to hard-code into the query going into Lucene. Make sense? There is another aspect of the cost beyond a per-postings iteration cost. The cost of reading the N+1 position is generally going to be much cheaper than reading the very first position since the first once possibly involves a seek. If only postings based queries are being compared via matchCost, this is a wash since all of them have this cost but it'd be different for a doc-values based query. Although perhaps it's a wash there too – assume one disk seek? Worst case of course.
        Hide
        Paul Elschot added a comment -

        an average number of operations that need to be performed in matches(), so that you would add +1 every time you do a comparison, arithmetic operation, consume a PostingsEnum, etc.

        That sounds doable. The Lucene50PostingsReader takes about such 7 operations for a nextPosition() call in case it does not seek and/or refill its buffer. I assume that would be the cost of consuming a PostingsEnum.

        The cost of reading the N+1 position is generally going to be much cheaper than reading the very first position since the first once possibly involves a seek.

        To take that into account an estimation of the seek/refill cost per term could be added in once per document.

        we may need to percolate the matchCost concept further into other APIs – namely ValueSource/FunctionValues

        Could that be done at another issue?

        Show
        Paul Elschot added a comment - an average number of operations that need to be performed in matches(), so that you would add +1 every time you do a comparison, arithmetic operation, consume a PostingsEnum, etc. That sounds doable. The Lucene50PostingsReader takes about such 7 operations for a nextPosition() call in case it does not seek and/or refill its buffer. I assume that would be the cost of consuming a PostingsEnum. The cost of reading the N+1 position is generally going to be much cheaper than reading the very first position since the first once possibly involves a seek. To take that into account an estimation of the seek/refill cost per term could be added in once per document. we may need to percolate the matchCost concept further into other APIs – namely ValueSource/FunctionValues Could that be done at another issue?
        Hide
        David Smiley added a comment -

        To take that into account an estimation of the seek/refill cost per term could be added in once per document.

        Right; I just wanted to point this out.

        we may need to percolate the matchCost concept further into other APIs – namely ValueSource/FunctionValues

        Could that be done at another issue?

        Yes, of course.

        Show
        David Smiley added a comment - To take that into account an estimation of the seek/refill cost per term could be added in once per document. Right; I just wanted to point this out. we may need to percolate the matchCost concept further into other APIs – namely ValueSource/FunctionValues Could that be done at another issue? Yes, of course.
        Hide
        Paul Elschot added a comment -

        Patch of 27 October 2015. This

        • is against trunk of today,
        • resolves the conflicts from Spans becoming a Scorer,
        • changes the matchCost to be an estimate of the number of operations needed for the positions. This uses a cost of 7 per position as in the codec code, and a basic cost 128 (no more than a guess) for the initial seek and refill.
        • adds a few more implementations to make ant compile-test pass at top level.

        Since 7 is rather low and the expected number of positions per document containing the term is just above 1 in many cases, I left matchCost() returning a float.

        Show
        Paul Elschot added a comment - Patch of 27 October 2015. This is against trunk of today, resolves the conflicts from Spans becoming a Scorer, changes the matchCost to be an estimate of the number of operations needed for the positions. This uses a cost of 7 per position as in the codec code, and a basic cost 128 (no more than a guess) for the initial seek and refill. adds a few more implementations to make ant compile-test pass at top level. Since 7 is rather low and the expected number of positions per document containing the term is just above 1 in many cases, I left matchCost() returning a float.
        Hide
        Adrien Grand added a comment -

        Some suggestions:

        • could the match costs be computed eagerly instead of lazily, like we compute the costs of DocIdSetIterators?
        • can you move the utility methods to compute costs of phrases from TwoPhaseIterator into PhraseWeight/SpanNearQuery. I don't like leaking implementation details of specific TwoPhaseIterators into TwoPhaseIterator.
        • some implementations of matchCost return 0 (eg. RandomAccessWeight); I'm fine with not implementing every match cost for now, but could you leave a TODO and use a higher constant (eg. 100)? I think it's safer to assume that such implementations are costly until they are implemented correctly.
        • re: ReqExclScorer: indeed I think we should try to take into account the cost of advancing the excluded scorer. If this is complicated, I'm fine with tackling this problem later.
        • re: DisjunctionScorer: I think your current definition of the match cost is fine. Maybe we could improve it by weighting the match cost of each TwoPhaseIterator by the cost of their approximation. I think it would make sense since we will call TwoPhaseIterator.matches() more often if their approximation matches more documents.
        • AssertingSpans/AsseringScorer ensure that the match cost is >= 0. Maybe we should also check for NaN and document acceptable return values in the documentation of #matchCost?
        Show
        Adrien Grand added a comment - Some suggestions: could the match costs be computed eagerly instead of lazily, like we compute the costs of DocIdSetIterators? can you move the utility methods to compute costs of phrases from TwoPhaseIterator into PhraseWeight/SpanNearQuery. I don't like leaking implementation details of specific TwoPhaseIterators into TwoPhaseIterator. some implementations of matchCost return 0 (eg. RandomAccessWeight); I'm fine with not implementing every match cost for now, but could you leave a TODO and use a higher constant (eg. 100)? I think it's safer to assume that such implementations are costly until they are implemented correctly. re: ReqExclScorer: indeed I think we should try to take into account the cost of advancing the excluded scorer. If this is complicated, I'm fine with tackling this problem later. re: DisjunctionScorer: I think your current definition of the match cost is fine. Maybe we could improve it by weighting the match cost of each TwoPhaseIterator by the cost of their approximation. I think it would make sense since we will call TwoPhaseIterator.matches() more often if their approximation matches more documents. AssertingSpans/AsseringScorer ensure that the match cost is >= 0. Maybe we should also check for NaN and document acceptable return values in the documentation of #matchCost?
        Hide
        Paul Elschot added a comment -

        I basically agree to all of these.

        ... move the utility methods to compute costs of phrases from TwoPhaseIterator into PhraseWeight/SpanNearQuery. I don't like leaking implementation details of specific TwoPhaseIterators into TwoPhaseIterator.

        and make them (package) private I assume? The only disadvantage of that is that some duplication of these methods is needed in the spans package.

        The easiest way to avoid such duplication would be when Spans move from o.a.l.search.spans to o.a.l.search.
        Iirc there was some talk of that not so long ago (Alan's plans for spans iirc), so how about waiting for that, possibly at a separate issue?

        It will take a while (at least a week) before I can continue with this. Please feel free to take it on.

        Show
        Paul Elschot added a comment - I basically agree to all of these. ... move the utility methods to compute costs of phrases from TwoPhaseIterator into PhraseWeight/SpanNearQuery. I don't like leaking implementation details of specific TwoPhaseIterators into TwoPhaseIterator. and make them (package) private I assume? The only disadvantage of that is that some duplication of these methods is needed in the spans package. The easiest way to avoid such duplication would be when Spans move from o.a.l.search.spans to o.a.l.search. Iirc there was some talk of that not so long ago (Alan's plans for spans iirc), so how about waiting for that, possibly at a separate issue? It will take a while (at least a week) before I can continue with this. Please feel free to take it on.
        Hide
        Paul Elschot added a comment -

        Patch of 2 Nov 2015.
        This addresses all the above concerns.
        It also precomputes positionsCost in SpanOrQuery, weighted by the cost() in the same way as matchCost.

        Show
        Paul Elschot added a comment - Patch of 2 Nov 2015. This addresses all the above concerns. It also precomputes positionsCost in SpanOrQuery, weighted by the cost() in the same way as matchCost.
        Hide
        Adrien Grand added a comment -
        • can you add to the javadocs of TwoPhaseIterator#matchCost that match costs need to be a positive number?
        • can you add some comments around the cost computation for disjunctions/conjunctions to explain the reasoning?
        • I would prefer termPositionsCost to be duplicated in PhraseWeight and SpanNearQuery than in TwoPhaseIterator, like we do for disjunctions (SpanOrQuery and DisjunctionScorer). I can understand the concerns around duplication but I think it's still cleaner than trying to share the logic by adding utility methods to TwoPhaseIterator.
        • I think SpanTermQuery.PHRASE_TO_SPAN_TERM_POSITIONS_COST should be static?

        Otherwise the change looks good to me, I like the cost definition for conjunctions/disjunctions/phrases and we can tackle other queries in follow-up issues, but I think this is already a great start and will help execute slow queries more efficiently!

        Show
        Adrien Grand added a comment - can you add to the javadocs of TwoPhaseIterator#matchCost that match costs need to be a positive number? can you add some comments around the cost computation for disjunctions/conjunctions to explain the reasoning? I would prefer termPositionsCost to be duplicated in PhraseWeight and SpanNearQuery than in TwoPhaseIterator, like we do for disjunctions (SpanOrQuery and DisjunctionScorer). I can understand the concerns around duplication but I think it's still cleaner than trying to share the logic by adding utility methods to TwoPhaseIterator. I think SpanTermQuery.PHRASE_TO_SPAN_TERM_POSITIONS_COST should be static? Otherwise the change looks good to me, I like the cost definition for conjunctions/disjunctions/phrases and we can tackle other queries in follow-up issues, but I think this is already a great start and will help execute slow queries more efficiently!
        Hide
        Paul Elschot added a comment -

        Patch of 7 Nov 2015.
        This addresses all concerns of 3 days ago.
        termPositionsCost moved from TwoPhaseIterator to PhraseQuery, I left a copy in SpanTermQuery because that is where it is used.

        Perhaps the result of ConjunctionSpans.asTwoPhaseIterator() should look more like TwoPhaseConjunctionDISI, at the moment I cannot get my head around this.

        Show
        Paul Elschot added a comment - Patch of 7 Nov 2015. This addresses all concerns of 3 days ago. termPositionsCost moved from TwoPhaseIterator to PhraseQuery, I left a copy in SpanTermQuery because that is where it is used. Perhaps the result of ConjunctionSpans.asTwoPhaseIterator() should look more like TwoPhaseConjunctionDISI, at the moment I cannot get my head around this.
        Hide
        Paul Elschot added a comment -

        I went over the patch and the earlier posts to get an overview of open points, TODO's, etc.
        There are quite a lot of them, so we'll need to prioritize and/or move/defer to other issues.

        lucene core:

        ConjunctionDISI matchCost(): give the lower matchCosts a higher weight

        PhraseQuery:
        TERM_POSNS_SEEK_OPS_PER_DOC = 128, guess
        PHRASE_TO_SPAN_TERM_POSITIONS_COST = 4, guess

        TwoPhaseIterator: Return value of matchCost(): long instead of float?

        RandomAccessWeight matchCost(): 10, use cost of matchingDocs.get()

        ReqExclScorer matchCost(): also use cost of exclApproximation.advance()

        SpanTermQuery: termPositionsCost is copy of PhraseQuery termPositionsCost

        SpanOrQuery: add cost of balancing priority queues for positions?

        facet module (defer to other issue):

        DoubleRange matchCost(): 100, use cost of range.accept()
        LongRange matchCost(): 100, use cost of range.accept()

        join module (defer to other issue ?):

        GlobalOrdinals(WithScore)Query matchCost(): 100, use cost of values.getOrd() and foundOrds.get()
        GlobalOrdinals(WithScore)Query 2nd matchCost(): 100, use cost of values.getOrd() and foundOrds.get()

        queries module (defer to other issue):

        ValueSourceScorer matchCost(): 100, use cost of ValueSourceScorer.this.matches()ValueSourceScorer matchCost(): 100, use cost of

        spatial module (defer to other issue)::

        CompositeVerifyQuery matchCost(): 100, use cost of predFuncValues.boolVal()
        IntersectsRPTVerifyQuery matchCost(): 100, use cost of exactIterator.advance() and predFuncValues.boolVal()

        test-framework module:

        RandomApproximationQuery randomMatchCost: between 0 and 200: ok?

        solr core:

        Filter matchCost(): 10, use cost of bits.get() ?

        At this issue:

        Performance test based on Wikipedia to estimate guessed values.

        tests for matchCost() ?

        Check result of ConjunctionSpans.asTwoPhaseIterator: more similar to TwoPhaseConjunctionDISI ?

        For other issues:

        At LUCENE-6871 remove copy of SpanTermQuery.termPositionsCost().

        SpanOrQuery is getting too big, split off DisjunctionSpans.

        cost() implementation of conjunctions and disjunctions could improve: add use of indepence assumption.
        The result of cost() is used here for weighting, so it should be good as possible.

        Show
        Paul Elschot added a comment - I went over the patch and the earlier posts to get an overview of open points, TODO's, etc. There are quite a lot of them, so we'll need to prioritize and/or move/defer to other issues. lucene core: ConjunctionDISI matchCost(): give the lower matchCosts a higher weight PhraseQuery: TERM_POSNS_SEEK_OPS_PER_DOC = 128, guess PHRASE_TO_SPAN_TERM_POSITIONS_COST = 4, guess TwoPhaseIterator: Return value of matchCost(): long instead of float? RandomAccessWeight matchCost(): 10, use cost of matchingDocs.get() ReqExclScorer matchCost(): also use cost of exclApproximation.advance() SpanTermQuery: termPositionsCost is copy of PhraseQuery termPositionsCost SpanOrQuery: add cost of balancing priority queues for positions? facet module (defer to other issue): DoubleRange matchCost(): 100, use cost of range.accept() LongRange matchCost(): 100, use cost of range.accept() join module (defer to other issue ?): GlobalOrdinals(WithScore)Query matchCost(): 100, use cost of values.getOrd() and foundOrds.get() GlobalOrdinals(WithScore)Query 2nd matchCost(): 100, use cost of values.getOrd() and foundOrds.get() queries module (defer to other issue): ValueSourceScorer matchCost(): 100, use cost of ValueSourceScorer.this.matches()ValueSourceScorer matchCost(): 100, use cost of spatial module (defer to other issue):: CompositeVerifyQuery matchCost(): 100, use cost of predFuncValues.boolVal() IntersectsRPTVerifyQuery matchCost(): 100, use cost of exactIterator.advance() and predFuncValues.boolVal() test-framework module: RandomApproximationQuery randomMatchCost: between 0 and 200: ok? solr core: Filter matchCost(): 10, use cost of bits.get() ? At this issue: Performance test based on Wikipedia to estimate guessed values. tests for matchCost() ? Check result of ConjunctionSpans.asTwoPhaseIterator: more similar to TwoPhaseConjunctionDISI ? For other issues: At LUCENE-6871 remove copy of SpanTermQuery.termPositionsCost(). SpanOrQuery is getting too big, split off DisjunctionSpans. cost() implementation of conjunctions and disjunctions could improve: add use of indepence assumption. The result of cost() is used here for weighting, so it should be good as possible.
        Hide
        Adrien Grand added a comment -

        ConjunctionDISI matchCost(): give the lower matchCosts a higher weight

        We could use the likelyness of a match, which should be given by Scorer.asTwoPhaseApproximation().approximation().cost()/Scorer.cost() even though I suspect that most implementations have no way to figure it out (eg. numeric doc values ranges). But I think we should defer, it's fine to assume worst case like the patch does today.

        TwoPhaseIterator: Return value of matchCost(): long instead of float?

        I would be ok with both, but given that matchCost is documented as "an expected cost in number of simple operations", maybe a long makes more sense? It also has the benefit of avoiding issues with ±0, Nans, infinities, etc.

        Performance test based on Wikipedia to estimate guessed values.

        I think this change is very hard to benchmark... I'm personally fine with moving on here without performance benchmarks.

        For other ones that I did not reply to, I suggest that we defer them: I don't think they should hold this change.

        Show
        Adrien Grand added a comment - ConjunctionDISI matchCost(): give the lower matchCosts a higher weight We could use the likelyness of a match, which should be given by Scorer.asTwoPhaseApproximation().approximation().cost()/Scorer.cost() even though I suspect that most implementations have no way to figure it out (eg. numeric doc values ranges). But I think we should defer, it's fine to assume worst case like the patch does today. TwoPhaseIterator: Return value of matchCost(): long instead of float? I would be ok with both, but given that matchCost is documented as "an expected cost in number of simple operations", maybe a long makes more sense? It also has the benefit of avoiding issues with ±0, Nans, infinities, etc. Performance test based on Wikipedia to estimate guessed values. I think this change is very hard to benchmark... I'm personally fine with moving on here without performance benchmarks. For other ones that I did not reply to, I suggest that we defer them: I don't think they should hold this change.
        Hide
        Paul Elschot added a comment -

        As to long/float: the expected outcome of rolling a dice is not a whole number, and as it happens, that has some similarities with the situation here.

        Show
        Paul Elschot added a comment - As to long/float: the expected outcome of rolling a dice is not a whole number, and as it happens, that has some similarities with the situation here.
        Hide
        Paul Elschot added a comment -

        I have opened LUCENE-6894 for the independence assumption for DISI.cost().

        Show
        Paul Elschot added a comment - I have opened LUCENE-6894 for the independence assumption for DISI.cost().
        Hide
        Adrien Grand added a comment -

        I'm +1 on the patch. I'll do some more testing locally in the next days and commit it.

        Show
        Adrien Grand added a comment - I'm +1 on the patch. I'll do some more testing locally in the next days and commit it.
        Hide
        Adrien Grand added a comment -

        Hmm, I'm getting failures with ComplexPhraseQuery:

           [junit4] Suite: org.apache.lucene.queryparser.complexPhrase.TestComplexPhraseQuery
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestComplexPhraseQuery -Dtests.method=testComplexPhrases -Dtests.seed=E7F242A6F40525AB -Dtests.slow=true -Dtests.locale=in_ID -Dtests.timezone=Africa/Banjul -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
           [junit4] FAILURE 0.18s J2 | TestComplexPhraseQuery.testComplexPhrases <<<
           [junit4]    > Throwable #1: java.lang.AssertionError
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([E7F242A6F40525AB:8259F91767B2FFA3]:0)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanOrQuery$SpanOrWeight$1.positionsCost(SpanOrQuery.java:261)
           [junit4]    > 	at org.apache.lucene.search.spans.ScoringWrapperSpans.positionsCost(ScoringWrapperSpans.java:88)
           [junit4]    > 	at org.apache.lucene.search.spans.FilterSpans$2.matchCost(FilterSpans.java:167)
           [junit4]    > 	at org.apache.lucene.search.ConjunctionDISI$TwoPhaseConjunctionDISI.<init>(ConjunctionDISI.java:186)
           [junit4]    > 	at org.apache.lucene.search.ConjunctionDISI$TwoPhaseConjunctionDISI.<init>(ConjunctionDISI.java:164)
           [junit4]    > 	at org.apache.lucene.search.ConjunctionDISI$TwoPhase.<init>(ConjunctionDISI.java:227)
           [junit4]    > 	at org.apache.lucene.search.ConjunctionDISI$TwoPhase.<init>(ConjunctionDISI.java:221)
           [junit4]    > 	at org.apache.lucene.search.ConjunctionDISI.intersect(ConjunctionDISI.java:50)
           [junit4]    > 	at org.apache.lucene.search.spans.ConjunctionSpans.<init>(ConjunctionSpans.java:43)
           [junit4]    > 	at org.apache.lucene.search.spans.NearSpansOrdered.<init>(NearSpansOrdered.java:56)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight.getSpans(SpanNearQuery.java:223)
           [junit4]    > 	at org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:134)
           [junit4]    > 	at org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
           [junit4]    > 	at org.apache.lucene.search.AssertingWeight.bulkScorer(AssertingWeight.java:69)
           [junit4]    > 	at org.apache.lucene.search.AssertingWeight.bulkScorer(AssertingWeight.java:69)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:667)
           [junit4]    > 	at org.apache.lucene.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:92)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:474)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:593)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:451)
           [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:462)
           [junit4]    > 	at org.apache.lucene.queryparser.complexPhrase.TestComplexPhraseQuery.checkMatches(TestComplexPhraseQuery.java:116)
           [junit4]    > 	at org.apache.lucene.queryparser.complexPhrase.TestComplexPhraseQuery.testComplexPhrases(TestComplexPhraseQuery.java:58)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
           [junit4]   2> NOTE: test params are: codec=Asserting(Lucene60): {role=PostingsFormat(name=Memory doPackFST= false), name=PostingsFormat(name=LuceneFixedGap), id=PostingsFormat(name=LuceneFixedGap)}, docValues:{}, sim=RandomSimilarityProvider(queryNorm=true,coord=no): {role=DFR I(n)LZ(0.3), name=DFR I(n)3(800.0), id=DFR I(n)3(800.0)}, locale=in_ID, timezone=Africa/Banjul
           [junit4]   2> NOTE: Linux 3.13.0-68-generic amd64/Oracle Corporation 1.8.0_66-ea (64-bit)/cpus=8,threads=1,free=187449736,total=253231104
           [junit4]   2> NOTE: All tests run in this JVM: [TestNumericRangeQueryBuilder, TestExtendableQueryParser, TestSpanQueryParser, TestExtensions, TestComplexPhraseQuery]
           [junit4] Completed [11/27] on J2 in 0.50s, 5 tests, 1 failure <<< FAILURES!
        

        I haven't looked into it yet.

        Show
        Adrien Grand added a comment - Hmm, I'm getting failures with ComplexPhraseQuery: [junit4] Suite: org.apache.lucene.queryparser.complexPhrase.TestComplexPhraseQuery [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestComplexPhraseQuery -Dtests.method=testComplexPhrases -Dtests.seed=E7F242A6F40525AB -Dtests.slow=true -Dtests.locale=in_ID -Dtests.timezone=Africa/Banjul -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] FAILURE 0.18s J2 | TestComplexPhraseQuery.testComplexPhrases <<< [junit4] > Throwable #1: java.lang.AssertionError [junit4] > at __randomizedtesting.SeedInfo.seed([E7F242A6F40525AB:8259F91767B2FFA3]:0) [junit4] > at org.apache.lucene.search.spans.SpanOrQuery$SpanOrWeight$1.positionsCost(SpanOrQuery.java:261) [junit4] > at org.apache.lucene.search.spans.ScoringWrapperSpans.positionsCost(ScoringWrapperSpans.java:88) [junit4] > at org.apache.lucene.search.spans.FilterSpans$2.matchCost(FilterSpans.java:167) [junit4] > at org.apache.lucene.search.ConjunctionDISI$TwoPhaseConjunctionDISI.<init>(ConjunctionDISI.java:186) [junit4] > at org.apache.lucene.search.ConjunctionDISI$TwoPhaseConjunctionDISI.<init>(ConjunctionDISI.java:164) [junit4] > at org.apache.lucene.search.ConjunctionDISI$TwoPhase.<init>(ConjunctionDISI.java:227) [junit4] > at org.apache.lucene.search.ConjunctionDISI$TwoPhase.<init>(ConjunctionDISI.java:221) [junit4] > at org.apache.lucene.search.ConjunctionDISI.intersect(ConjunctionDISI.java:50) [junit4] > at org.apache.lucene.search.spans.ConjunctionSpans.<init>(ConjunctionSpans.java:43) [junit4] > at org.apache.lucene.search.spans.NearSpansOrdered.<init>(NearSpansOrdered.java:56) [junit4] > at org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight.getSpans(SpanNearQuery.java:223) [junit4] > at org.apache.lucene.search.spans.SpanWeight.scorer(SpanWeight.java:134) [junit4] > at org.apache.lucene.search.Weight.bulkScorer(Weight.java:135) [junit4] > at org.apache.lucene.search.AssertingWeight.bulkScorer(AssertingWeight.java:69) [junit4] > at org.apache.lucene.search.AssertingWeight.bulkScorer(AssertingWeight.java:69) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:667) [junit4] > at org.apache.lucene.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:92) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:474) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:593) [junit4] > at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:451) [junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:462) [junit4] > at org.apache.lucene.queryparser.complexPhrase.TestComplexPhraseQuery.checkMatches(TestComplexPhraseQuery.java:116) [junit4] > at org.apache.lucene.queryparser.complexPhrase.TestComplexPhraseQuery.testComplexPhrases(TestComplexPhraseQuery.java:58) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene60): {role=PostingsFormat(name=Memory doPackFST= false), name=PostingsFormat(name=LuceneFixedGap), id=PostingsFormat(name=LuceneFixedGap)}, docValues:{}, sim=RandomSimilarityProvider(queryNorm=true,coord=no): {role=DFR I(n)LZ(0.3), name=DFR I(n)3(800.0), id=DFR I(n)3(800.0)}, locale=in_ID, timezone=Africa/Banjul [junit4] 2> NOTE: Linux 3.13.0-68-generic amd64/Oracle Corporation 1.8.0_66-ea (64-bit)/cpus=8,threads=1,free=187449736,total=253231104 [junit4] 2> NOTE: All tests run in this JVM: [TestNumericRangeQueryBuilder, TestExtendableQueryParser, TestSpanQueryParser, TestExtensions, TestComplexPhraseQuery] [junit4] Completed [11/27] on J2 in 0.50s, 5 tests, 1 failure <<< FAILURES! I haven't looked into it yet.
        Hide
        Paul Elschot added a comment -

        The test failure reproduces here. I'll take a look, thanks.

        Show
        Paul Elschot added a comment - The test failure reproduces here. I'll take a look, thanks.
        Hide
        Paul Elschot added a comment -

        This failure disappeared after adding asTwoPhaseIterator() to ScoringWrapperSpans. I'll post a new patch later.

        Show
        Paul Elschot added a comment - This failure disappeared after adding asTwoPhaseIterator() to ScoringWrapperSpans. I'll post a new patch later.
        Hide
        Paul Elschot added a comment -

        Patch of 12 Nov 2015.
        Adds ScoringWrapperSpans.asTwoPhaseIterator().

        This missing method could be a bug in itself.

        Show
        Paul Elschot added a comment - Patch of 12 Nov 2015. Adds ScoringWrapperSpans.asTwoPhaseIterator(). This missing method could be a bug in itself.
        Hide
        ASF subversion and git services added a comment -

        Commit 1714261 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1714261 ]

        LUCENE-6276: Added TwoPhaseIterator.matchCost().

        Show
        ASF subversion and git services added a comment - Commit 1714261 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1714261 ] LUCENE-6276 : Added TwoPhaseIterator.matchCost().
        Hide
        ASF subversion and git services added a comment -

        Commit 1714266 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1714266 ]

        LUCENE-6276: Added TwoPhaseIterator.matchCost().

        Show
        ASF subversion and git services added a comment - Commit 1714266 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1714266 ] LUCENE-6276 : Added TwoPhaseIterator.matchCost().
        Hide
        Adrien Grand added a comment -

        I just committed the changes. Thanks Paul!

        Show
        Adrien Grand added a comment - I just committed the changes. Thanks Paul!

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development