Lucene - Core
  1. Lucene - Core
  2. LUCENE-1252

Avoid using positions when not all required terms are present

    Details

    • Type: Wish Wish
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:

      Description

      In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present.
      This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like "all required terms are present, but no position info was checked yet", and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans.

        Activity

        Hide
        Paul Elschot added a comment -

        LUCENE-6198 and followups do this.

        Show
        Paul Elschot added a comment - LUCENE-6198 and followups do this.
        Hide
        Paul Elschot added a comment -

        Thinking out loud:

        How about adding a method toDocWithAllRequiredTerms(), and a method toPositionMatch() that may move to a later doc?

        These two together should do what nextDoc() does now.

        Show
        Paul Elschot added a comment - Thinking out loud: How about adding a method toDocWithAllRequiredTerms(), and a method toPositionMatch() that may move to a later doc? These two together should do what nextDoc() does now.
        Hide
        Robert Muir added a comment -

        Both of those are covered. Yes its proposed as a first-class citizen (on DISI), more first-class than your "second-class" rewrite() solution.

        It might be good to already look at what BQ and co do with query execution today. the fact is, rewrite() is inappropriate.

        the main reason rewrite() is second-class here is because scoring is per-segment in lucene. In trunk today this means:

        • filters might get executed with a completely different "plan" for different segments (conjunction versus Bits->acceptDocs).
        • scorers might get returned as null for some segments (e.g. term doesnt exist). BQ doesn't have special null handling in every subscorer that it uses, that would be horrible. Instead it eliminates such scorers immediately in BooleanWeight's constructor.
        • this like the above can impact the structure of what BQ does (e.g. A or B, if B is null, it just returns A instead of a DisjunctionScorer).
        • scorers have a cost() api, which tells us additional critical information for execution. I've proposed expanding this to much more (additional flags/methods) so that BQ can be a quarterback and not the entire football team.

        another reason rewrite() is wrong here, is that such two-phase execution is only useful for conjunction (or: conjunction-like queries such as MinShouldMatch). Otherwise, its useless. Rewriting to two queries when the query is not a conjunction will not help performance.

        at the low level, I can't see a rewrite() solution being efficient. Today ExactPhraseScorer is already coded as:

        while (iterate approximation...) {
          confirm(); // phraseFreq > 0
        }
        

        so in the case, its already ready to be exposed for two-phase execution.

        On the other hand with rewrite(), it would either be 2 separate D&Penums, or a mess?

        Finally, by having this generic on DISI, this "approximation" becomes much more flexible. For example a postings format could implement it for its docsenums, if it is able to implement a cheap approximation... perhaps via skiplist-like data, or perhaps with an explicit "approximate" cache mechanism specifically for this purpose.

        Another example is a filter like a geographic distance filter, could return a bounding box here automatically rather than forcing the user to deal with this themselves with complex query/filter hacks. At the same time, such a filter could signal that it has a linear time nextDoc, and booleanquery can really do the best execution. A rewrite() solution really makes this hard, because the two things would then be completely separate queries. But if you look at all the cases of executing this logic, its very useful for BQ to know that A is a superset of B.

        Show
        Robert Muir added a comment - Both of those are covered. Yes its proposed as a first-class citizen (on DISI), more first-class than your "second-class" rewrite() solution. It might be good to already look at what BQ and co do with query execution today. the fact is, rewrite() is inappropriate. the main reason rewrite() is second-class here is because scoring is per-segment in lucene. In trunk today this means: filters might get executed with a completely different "plan" for different segments (conjunction versus Bits->acceptDocs). scorers might get returned as null for some segments (e.g. term doesnt exist). BQ doesn't have special null handling in every subscorer that it uses, that would be horrible. Instead it eliminates such scorers immediately in BooleanWeight's constructor. this like the above can impact the structure of what BQ does (e.g. A or B, if B is null, it just returns A instead of a DisjunctionScorer). scorers have a cost() api, which tells us additional critical information for execution. I've proposed expanding this to much more (additional flags/methods) so that BQ can be a quarterback and not the entire football team. another reason rewrite() is wrong here, is that such two-phase execution is only useful for conjunction (or: conjunction-like queries such as MinShouldMatch). Otherwise, its useless. Rewriting to two queries when the query is not a conjunction will not help performance. at the low level, I can't see a rewrite() solution being efficient. Today ExactPhraseScorer is already coded as: while (iterate approximation...) { confirm(); // phraseFreq > 0 } so in the case, its already ready to be exposed for two-phase execution. On the other hand with rewrite(), it would either be 2 separate D&Penums, or a mess? Finally, by having this generic on DISI, this "approximation" becomes much more flexible. For example a postings format could implement it for its docsenums, if it is able to implement a cheap approximation... perhaps via skiplist-like data, or perhaps with an explicit "approximate" cache mechanism specifically for this purpose. Another example is a filter like a geographic distance filter, could return a bounding box here automatically rather than forcing the user to deal with this themselves with complex query/filter hacks. At the same time, such a filter could signal that it has a linear time nextDoc, and booleanquery can really do the best execution. A rewrite() solution really makes this hard, because the two things would then be completely separate queries. But if you look at all the cases of executing this logic, its very useful for BQ to know that A is a superset of B.
        Hide
        Michael McCandless added a comment -

        I was fairly specific about how to tackle the problem... I dont mean to be rude, but did you actually read my proposal?

        Yes, you were specific, and yes, I did in fact read the proposal,
        several times, and as I said, I like it, but it's a complex problem
        that I've also spent time thinking about and so I gave some simple
        feedback about an alternative to one part of the proposal:

        It just seems strange to me, when Lucene already has this notion of
        "boiling down complex queries into their low-level concrete equivalent
        queries" (i.e. rewrite) to then add a separate API ("two-phase
        execution") for doing what looks to me to be essentially the same
        thing. Two-phase execution is just rewriting to a two-clause
        conjunction where the cost of each clause are wildly different,
        opening up the freedom for BQ to execute more efficiently. And,
        yes, like rewrite, I think it should "recurse".

        I also think BQ should handle the "costly nextDoc" filters (i.e. do
        them last) as first class citizens, and it seemed like the proposal
        suggested otherwise (maybe I misunderstood this part of the proposal).

        Show
        Michael McCandless added a comment - I was fairly specific about how to tackle the problem... I dont mean to be rude, but did you actually read my proposal? Yes, you were specific, and yes, I did in fact read the proposal, several times, and as I said, I like it, but it's a complex problem that I've also spent time thinking about and so I gave some simple feedback about an alternative to one part of the proposal: It just seems strange to me, when Lucene already has this notion of "boiling down complex queries into their low-level concrete equivalent queries" (i.e. rewrite) to then add a separate API ("two-phase execution") for doing what looks to me to be essentially the same thing. Two-phase execution is just rewriting to a two-clause conjunction where the cost of each clause are wildly different, opening up the freedom for BQ to execute more efficiently. And, yes, like rewrite, I think it should "recurse". I also think BQ should handle the "costly nextDoc" filters (i.e. do them last) as first class citizens, and it seemed like the proposal suggested otherwise (maybe I misunderstood this part of the proposal).
        Hide
        Robert Muir added a comment -

        I was fairly specific about how to tackle the problem... I dont mean to be rude, but did you actually read my proposal?

        Show
        Robert Muir added a comment - I was fairly specific about how to tackle the problem... I dont mean to be rude, but did you actually read my proposal?
        Hide
        Michael McCandless added a comment -

        +1 to this plan.

        Except if possible, I think "high cost filters" really should be first class in Lucene? I think the "two-phase execution" is in fact the same problem?

        I.e., a PhraseQuery should just rewrite itself into a conjunction of two clauses: the "pure conjunction" of the terms, which is fast to execute, AND'd with a high-cost filter (that loads positions and checks that the terms are actually a phrase). If BQ knew how to juggle such costly clauses then this issue (AND'ing a PhraseQuery with other simple clauses) and other example like geo filtering (which could rewrite into a cheap bbox AND'd w/ slow true distance check) should work well.

        But if somehow making high cost filters first class is too much, then I agree: let's punt, here. It just seems to me that it's the same issue as two-phase execution...

        Show
        Michael McCandless added a comment - +1 to this plan. Except if possible, I think "high cost filters" really should be first class in Lucene? I think the "two-phase execution" is in fact the same problem? I.e., a PhraseQuery should just rewrite itself into a conjunction of two clauses: the "pure conjunction" of the terms, which is fast to execute, AND'd with a high-cost filter (that loads positions and checks that the terms are actually a phrase). If BQ knew how to juggle such costly clauses then this issue (AND'ing a PhraseQuery with other simple clauses) and other example like geo filtering (which could rewrite into a cheap bbox AND'd w/ slow true distance check) should work well. But if somehow making high cost filters first class is too much, then I agree: let's punt, here. It just seems to me that it's the same issue as two-phase execution...
        Hide
        Robert Muir added a comment -

        Reopening for discussion.

        Show
        Robert Muir added a comment - Reopening for discussion.
        Hide
        Robert Muir added a comment -

        I think we should keep the issue open, I know I've been thinking about this one a lot lately.

        The thing I see is, for it to work really nicely, BooleanQuery really needs to own execution of both queries and filters.

        some kind of blurry proposal/plan like this:

        Execute filters by BooleanQuery instead of its mini-me (FilteredQuery), e.g. as an additional type of BooleanClause.

        Merge Filter and Weight, in some way that makes sense, e.g. maybe just make Weight.scorer(LeafReaderContext context, Bits acceptDocs) a covariant-return override of Filter.getDocIdSet(LeafReaderContext context, Bits acceptDocs). Make sure any "wrappers" like ConstantScore delegate any new APIs correctly.

        Add bulk methods like and/or/not to Filter such that optimized impls like FixedBitSet.and() can be used. Since java 7u40 these ones get autovectorized by hotspot and are a valid strategy. I think maybe some of these could be optimized by sparse bitset impls as well.

        Create an enhanced cost metric/execution API for filters. BooleanQuery needs this additional context to give the most efficient execution. At the least, it should have the information to know to do the bulk optos above, and even apply deletes this way if its appropriate (in lucene 5 deleted docs are a FixedBitSet). I would also want a way to indicate that a Filter has a linear-time nextDoc(). these cases (e.g. filtering by exact geographic distance) are horrible to support, but handling them correctly (e.g. in a final phase) is a lesser evil than having the API be crazy so that systems like solr/es can do them with hacks.

        Remove stuff like FilteredQuery, BooleanFilter, etc.

        Fix LUCENE-3331 (or impl in some other way), such that "scores are not needed" is passed down the query execution stack. The tricky part is BQ's "execution plan" is currently in two places really, rewrite() and Weight.scorer(). And I really think it needs the freedom to be able to completely restructure queries for performance (across nested BQ as well). Another option is to setup internal infra so BooleanWeight.scorer() can do this, as it have cost() knowledge too, but it feels so wrong.

        Finally, we should add some support for "two-phase execution" via DISI.getSuperSet() or some other approximation. ConjunctionScorer could both use (when at least one sub supports) and implement this method (when e.g. coord scoring prevents optimal restructing and its nested) for faster AND/filtering of phrase/sloppy/spans/whatever, or for any other custom query/filter that supports a fast approximation.

        Show
        Robert Muir added a comment - I think we should keep the issue open, I know I've been thinking about this one a lot lately. The thing I see is, for it to work really nicely, BooleanQuery really needs to own execution of both queries and filters. some kind of blurry proposal/plan like this: Execute filters by BooleanQuery instead of its mini-me (FilteredQuery), e.g. as an additional type of BooleanClause. Merge Filter and Weight, in some way that makes sense, e.g. maybe just make Weight.scorer(LeafReaderContext context, Bits acceptDocs) a covariant-return override of Filter.getDocIdSet(LeafReaderContext context, Bits acceptDocs). Make sure any "wrappers" like ConstantScore delegate any new APIs correctly. Add bulk methods like and/or/not to Filter such that optimized impls like FixedBitSet.and() can be used. Since java 7u40 these ones get autovectorized by hotspot and are a valid strategy. I think maybe some of these could be optimized by sparse bitset impls as well. Create an enhanced cost metric/execution API for filters. BooleanQuery needs this additional context to give the most efficient execution. At the least, it should have the information to know to do the bulk optos above, and even apply deletes this way if its appropriate (in lucene 5 deleted docs are a FixedBitSet). I would also want a way to indicate that a Filter has a linear-time nextDoc(). these cases (e.g. filtering by exact geographic distance) are horrible to support, but handling them correctly (e.g. in a final phase) is a lesser evil than having the API be crazy so that systems like solr/es can do them with hacks. Remove stuff like FilteredQuery, BooleanFilter, etc. Fix LUCENE-3331 (or impl in some other way), such that "scores are not needed" is passed down the query execution stack. The tricky part is BQ's "execution plan" is currently in two places really, rewrite() and Weight.scorer(). And I really think it needs the freedom to be able to completely restructure queries for performance (across nested BQ as well). Another option is to setup internal infra so BooleanWeight.scorer() can do this, as it have cost() knowledge too, but it feels so wrong. Finally, we should add some support for "two-phase execution" via DISI.getSuperSet() or some other approximation. ConjunctionScorer could both use (when at least one sub supports) and implement this method (when e.g. coord scoring prevents optimal restructing and its nested) for faster AND/filtering of phrase/sloppy/spans/whatever, or for any other custom query/filter that supports a fast approximation.
        Hide
        Paul Elschot added a comment -

        Not enough interest.

        Show
        Paul Elschot added a comment - Not enough interest.
        Hide
        Paul Elschot added a comment -

        LUCENE-2410 has solved this partially for PhraseQuery/PhraseScorer by computing only the first matching phrase to determine a possible match, and by delaying the computation of the remaining matches until score() is called.

        Show
        Paul Elschot added a comment - LUCENE-2410 has solved this partially for PhraseQuery/PhraseScorer by computing only the first matching phrase to determine a possible match, and by delaying the computation of the remaining matches until score() is called.
        Hide
        Michael McCandless added a comment -

        I agree, we would want to do a ConjunctionScorer first, w/ all 4 terms, and then 2nd a PhraseScorer for each of the two phrases, but somehow they should be bound together such that a single TermPositions enumerator is shared in the two places for each term.

        I think doing the additional filtering in collect is a little late – there could be a number of such "more expensive constraints" to apply, depending on the query.

        But, eg since we're talking about how to fix the up-front sort logic in ConjunctionScorer... you could imagine asking the PhraseQuery for its scorer, and getting back 2 AND'd scorers (cheap & expensive) that under-the-hood are sharing a single TermPositions enum, and then ConjunctionScorer would order all such scorers it got so that all cheap ones are checked first and only once they agree on a doc are the expensive scorers check.

        Show
        Michael McCandless added a comment - I agree, we would want to do a ConjunctionScorer first, w/ all 4 terms, and then 2nd a PhraseScorer for each of the two phrases, but somehow they should be bound together such that a single TermPositions enumerator is shared in the two places for each term. I think doing the additional filtering in collect is a little late – there could be a number of such "more expensive constraints" to apply, depending on the query. But, eg since we're talking about how to fix the up-front sort logic in ConjunctionScorer... you could imagine asking the PhraseQuery for its scorer, and getting back 2 AND'd scorers (cheap & expensive) that under-the-hood are sharing a single TermPositions enum, and then ConjunctionScorer would order all such scorers it got so that all cheap ones are checked first and only once they agree on a doc are the expensive scorers check.
        Hide
        Shai Erera added a comment -

        I must admit that I read the issue briefly, and so my proposal below may be irrelevant.

        But it sounds to me like for the above example we'd want to use ConjunctionScorer to first iterate on both Scorers (PhraseScorer?) until they agree, and only then call Collector.collect(), which will call Scorer.score() in return (if a scoring collector is used).

        Alternatively, we can implement a FilterCollector (if there isn't one already) which impls its collect(int doc) by first determining if a document is a match, and then call a given Collector, which will proceed with collecting and scoring.

        Show
        Shai Erera added a comment - I must admit that I read the issue briefly, and so my proposal below may be irrelevant. But it sounds to me like for the above example we'd want to use ConjunctionScorer to first iterate on both Scorers (PhraseScorer?) until they agree, and only then call Collector.collect(), which will call Scorer.score() in return (if a scoring collector is used). Alternatively, we can implement a FilterCollector (if there isn't one already) which impls its collect(int doc) by first determining if a document is a match, and then call a given Collector, which will proceed with collecting and scoring.
        Hide
        Michael McCandless added a comment -

        Here's a simple example that might drive this issue forward:

        +"h1n1 flu" +"united states"

        Ideally, to score this query, you'd want to first AND all 4 terms
        together, and only for docs matching that, consult the positions of
        each pair of terms.

        But we fail to do this today.

        It's like somehow "Weight.scorer()" needs to be able to return a
        "cheap" and an "expensive" scorer (which must be AND'd). I think
        PhraseQuery would somehow return cheap/expensive scorers that under
        the hood share the same SegmentTermDocs/Positions iterators, such that
        after cheap.next() has run, cheap.expensive only needs to "check the
        current doc". So in fact maybe the expensive scorer should not be a
        Scorer but some other simple "passes or doesn't" API.

        Or maybe it returns say a TwoStageScorer, which adds a
        "reallyPasses()" (needs better name) method to otherwise normal the
        Scorer (DISI) API.

        Or something else....

        Show
        Michael McCandless added a comment - Here's a simple example that might drive this issue forward: +"h1n1 flu" +"united states" Ideally, to score this query, you'd want to first AND all 4 terms together, and only for docs matching that, consult the positions of each pair of terms. But we fail to do this today. It's like somehow "Weight.scorer()" needs to be able to return a "cheap" and an "expensive" scorer (which must be AND'd). I think PhraseQuery would somehow return cheap/expensive scorers that under the hood share the same SegmentTermDocs/Positions iterators, such that after cheap.next() has run, cheap.expensive only needs to "check the current doc". So in fact maybe the expensive scorer should not be a Scorer but some other simple "passes or doesn't" API. Or maybe it returns say a TwoStageScorer, which adds a "reallyPasses()" (needs better name) method to otherwise normal the Scorer (DISI) API. Or something else....
        Hide
        Michael McCandless added a comment -

        Right, I think this is more about determining whether a doc is a hit or not, than about how to compute its score.

        I think somehow the scorer needs to return 2 scorers that share the underlying iterators. The first scorer simply checks "AND"-ness with all other required terms, and only if the doc passes those are the positional/payloads/anything-else-expensive consulted.

        Show
        Michael McCandless added a comment - Right, I think this is more about determining whether a doc is a hit or not, than about how to compute its score. I think somehow the scorer needs to return 2 scorers that share the underlying iterators. The first scorer simply checks "AND"-ness with all other required terms, and only if the doc passes those are the positional/payloads/anything-else-expensive consulted.
        Hide
        Paul Elschot added a comment -

        There is no patch for now.

        HitCollectors should not be affected by this, as they would only be involved when a real match is found, and that, when position info is needed, necessarily involves the positions.

        Extending this with a cheap score brings another issue: should a cheap score be given for a document that might match, but in the end does not really match when positions are used? At the moment, I don't think so: score values are normally cheap to compute, but accessing positions is not cheap.

        Show
        Paul Elschot added a comment - There is no patch for now. HitCollectors should not be affected by this, as they would only be involved when a real match is found, and that, when position info is needed, necessarily involves the positions. Extending this with a cheap score brings another issue: should a cheap score be given for a document that might match, but in the end does not really match when positions are used? At the moment, I don't think so: score values are normally cheap to compute, but accessing positions is not cheap.
        Hide
        Jason Rutherglen added a comment -

        When flexible indexing goes in, users will be able to put data
        into the index that allow scorers to calculate a "cheap" score,
        collect, then go through and calculate a presumably more
        expensive score.

        Would it be good to implement this patch with this sort of more
        general framework in mind?

        It seems like this could affect the HitCollector API as we'd
        want a more generic way of representing scores than the
        primitive "float" we assume now. Aren't we rewriting the
        HitCollector APIs right now? Can we implement this change now?

        Show
        Jason Rutherglen added a comment - When flexible indexing goes in, users will be able to put data into the index that allow scorers to calculate a "cheap" score, collect, then go through and calculate a presumably more expensive score. Would it be good to implement this patch with this sort of more general framework in mind? It seems like this could affect the HitCollector API as we'd want a more generic way of representing scores than the primitive "float" we assume now. Aren't we rewriting the HitCollector APIs right now? Can we implement this change now?

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development