Lucene - Core
  1. Lucene - Core
  2. LUCENE-1252

Avoid using positions when not all required terms are present

    Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
    • Lucene Fields:
      New

      Description

      In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present.
      This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like "all required terms are present, but no position info was checked yet", and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans.

        Activity

        Hide
        Paul Elschot added a comment -

        LUCENE-2410 has solved this partially for PhraseQuery/PhraseScorer by computing only the first matching phrase to determine a possible match, and by delaying the computation of the remaining matches until score() is called.

        Show
        Paul Elschot added a comment - LUCENE-2410 has solved this partially for PhraseQuery/PhraseScorer by computing only the first matching phrase to determine a possible match, and by delaying the computation of the remaining matches until score() is called.
        Hide
        Michael McCandless added a comment -

        I agree, we would want to do a ConjunctionScorer first, w/ all 4 terms, and then 2nd a PhraseScorer for each of the two phrases, but somehow they should be bound together such that a single TermPositions enumerator is shared in the two places for each term.

        I think doing the additional filtering in collect is a little late – there could be a number of such "more expensive constraints" to apply, depending on the query.

        But, eg since we're talking about how to fix the up-front sort logic in ConjunctionScorer... you could imagine asking the PhraseQuery for its scorer, and getting back 2 AND'd scorers (cheap & expensive) that under-the-hood are sharing a single TermPositions enum, and then ConjunctionScorer would order all such scorers it got so that all cheap ones are checked first and only once they agree on a doc are the expensive scorers check.

        Show
        Michael McCandless added a comment - I agree, we would want to do a ConjunctionScorer first, w/ all 4 terms, and then 2nd a PhraseScorer for each of the two phrases, but somehow they should be bound together such that a single TermPositions enumerator is shared in the two places for each term. I think doing the additional filtering in collect is a little late – there could be a number of such "more expensive constraints" to apply, depending on the query. But, eg since we're talking about how to fix the up-front sort logic in ConjunctionScorer... you could imagine asking the PhraseQuery for its scorer, and getting back 2 AND'd scorers (cheap & expensive) that under-the-hood are sharing a single TermPositions enum, and then ConjunctionScorer would order all such scorers it got so that all cheap ones are checked first and only once they agree on a doc are the expensive scorers check.
        Hide
        Shai Erera added a comment -

        I must admit that I read the issue briefly, and so my proposal below may be irrelevant.

        But it sounds to me like for the above example we'd want to use ConjunctionScorer to first iterate on both Scorers (PhraseScorer?) until they agree, and only then call Collector.collect(), which will call Scorer.score() in return (if a scoring collector is used).

        Alternatively, we can implement a FilterCollector (if there isn't one already) which impls its collect(int doc) by first determining if a document is a match, and then call a given Collector, which will proceed with collecting and scoring.

        Show
        Shai Erera added a comment - I must admit that I read the issue briefly, and so my proposal below may be irrelevant. But it sounds to me like for the above example we'd want to use ConjunctionScorer to first iterate on both Scorers (PhraseScorer?) until they agree, and only then call Collector.collect(), which will call Scorer.score() in return (if a scoring collector is used). Alternatively, we can implement a FilterCollector (if there isn't one already) which impls its collect(int doc) by first determining if a document is a match, and then call a given Collector, which will proceed with collecting and scoring.
        Hide
        Michael McCandless added a comment -

        Here's a simple example that might drive this issue forward:

        +"h1n1 flu" +"united states"

        Ideally, to score this query, you'd want to first AND all 4 terms
        together, and only for docs matching that, consult the positions of
        each pair of terms.

        But we fail to do this today.

        It's like somehow "Weight.scorer()" needs to be able to return a
        "cheap" and an "expensive" scorer (which must be AND'd). I think
        PhraseQuery would somehow return cheap/expensive scorers that under
        the hood share the same SegmentTermDocs/Positions iterators, such that
        after cheap.next() has run, cheap.expensive only needs to "check the
        current doc". So in fact maybe the expensive scorer should not be a
        Scorer but some other simple "passes or doesn't" API.

        Or maybe it returns say a TwoStageScorer, which adds a
        "reallyPasses()" (needs better name) method to otherwise normal the
        Scorer (DISI) API.

        Or something else....

        Show
        Michael McCandless added a comment - Here's a simple example that might drive this issue forward: +"h1n1 flu" +"united states" Ideally, to score this query, you'd want to first AND all 4 terms together, and only for docs matching that, consult the positions of each pair of terms. But we fail to do this today. It's like somehow "Weight.scorer()" needs to be able to return a "cheap" and an "expensive" scorer (which must be AND'd). I think PhraseQuery would somehow return cheap/expensive scorers that under the hood share the same SegmentTermDocs/Positions iterators, such that after cheap.next() has run, cheap.expensive only needs to "check the current doc". So in fact maybe the expensive scorer should not be a Scorer but some other simple "passes or doesn't" API. Or maybe it returns say a TwoStageScorer, which adds a "reallyPasses()" (needs better name) method to otherwise normal the Scorer (DISI) API. Or something else....
        Hide
        Michael McCandless added a comment -

        Right, I think this is more about determining whether a doc is a hit or not, than about how to compute its score.

        I think somehow the scorer needs to return 2 scorers that share the underlying iterators. The first scorer simply checks "AND"-ness with all other required terms, and only if the doc passes those are the positional/payloads/anything-else-expensive consulted.

        Show
        Michael McCandless added a comment - Right, I think this is more about determining whether a doc is a hit or not, than about how to compute its score. I think somehow the scorer needs to return 2 scorers that share the underlying iterators. The first scorer simply checks "AND"-ness with all other required terms, and only if the doc passes those are the positional/payloads/anything-else-expensive consulted.
        Hide
        Paul Elschot added a comment -

        There is no patch for now.

        HitCollectors should not be affected by this, as they would only be involved when a real match is found, and that, when position info is needed, necessarily involves the positions.

        Extending this with a cheap score brings another issue: should a cheap score be given for a document that might match, but in the end does not really match when positions are used? At the moment, I don't think so: score values are normally cheap to compute, but accessing positions is not cheap.

        Show
        Paul Elschot added a comment - There is no patch for now. HitCollectors should not be affected by this, as they would only be involved when a real match is found, and that, when position info is needed, necessarily involves the positions. Extending this with a cheap score brings another issue: should a cheap score be given for a document that might match, but in the end does not really match when positions are used? At the moment, I don't think so: score values are normally cheap to compute, but accessing positions is not cheap.
        Hide
        Jason Rutherglen added a comment -

        When flexible indexing goes in, users will be able to put data
        into the index that allow scorers to calculate a "cheap" score,
        collect, then go through and calculate a presumably more
        expensive score.

        Would it be good to implement this patch with this sort of more
        general framework in mind?

        It seems like this could affect the HitCollector API as we'd
        want a more generic way of representing scores than the
        primitive "float" we assume now. Aren't we rewriting the
        HitCollector APIs right now? Can we implement this change now?

        Show
        Jason Rutherglen added a comment - When flexible indexing goes in, users will be able to put data into the index that allow scorers to calculate a "cheap" score, collect, then go through and calculate a presumably more expensive score. Would it be good to implement this patch with this sort of more general framework in mind? It seems like this could affect the HitCollector API as we'd want a more generic way of representing scores than the primitive "float" we assume now. Aren't we rewriting the HitCollector APIs right now? Can we implement this change now?

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development