Lucene - Core
  1. Lucene - Core
  2. LUCENE-2686

DisjunctionSumScorer should not call .score on sub scorers until consumer calls .score

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-BETA, 5.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from java-user thread "question about Scorer.freq()" from Koji...

      BooleanScorer2 uses DisjunctionSumScorer to score only-SHOULD-clause boolean queries.

      But, this scorer does too much work for collectors that never call .score, because it scores while it's matching. It should only call .score on the subs when the caller calls its .score.

      This also has the side effect of messing up advanced collectors that gather the freq() of the subs (using LUCENE-2590).

      1. LUCENE-2686.patch
        29 kB
        Michael McCandless
      2. LUCENE-2686.patch
        26 kB
        Michael McCandless
      3. Test2LUCENE2590.java
        6 kB
        Koji Sekiguchi

        Issue Links

          Activity

          Hide
          Paul Elschot added a comment -

          For the record, the reason for calling score() in the current way is to avoid the housekeeping of recording which scorer(s) actually matched.

          Instead one could perhaps walk the binary tree in the scorer queue heap and only add to the sum score and recurse into the tree when the current doc of the scorer equals the current doc. Maybe this can even done without recursive calls.

          Show
          Paul Elschot added a comment - For the record, the reason for calling score() in the current way is to avoid the housekeeping of recording which scorer(s) actually matched. Instead one could perhaps walk the binary tree in the scorer queue heap and only add to the sum score and recurse into the tree when the current doc of the scorer equals the current doc. Maybe this can even done without recursive calls.
          Hide
          Koji Sekiguchi added a comment -

          Thanks Mike for opening this!

          The attached program is what I want to do - I'd like to know which field a match occurred. TestSubScorerFreqs of LUCENE-2590 calls BooleanScorer2.freq() and it returns expected freq count. In my program, I get TermScorer from BooleanScorer2 via ScorerVisitor and try to call TermScorer.freq() in collect() method:

          public void collect(int doc) throws IOException {
            int freq = 0;
            for( TermQueryScorer tqs : tqsSet ){
              Scorer scorer = tqs.scorer;
              int matchId = scorer.docID();    // matchId isn't expected
              if( matchId == doc ){
                freq += scorer.freq();              // this line is never executed
              }
            }
            docCounts.put(doc + docBase, freq);
            collector.collect(doc);
          }
          

          but TermScorer.docID() returns unexpected id and TermScorer.freq() isn't executed (even if I remove "matchId == doc" condition, TermScorer.freq() returns unexpected number anyway).

          Show
          Koji Sekiguchi added a comment - Thanks Mike for opening this! The attached program is what I want to do - I'd like to know which field a match occurred. TestSubScorerFreqs of LUCENE-2590 calls BooleanScorer2.freq() and it returns expected freq count. In my program, I get TermScorer from BooleanScorer2 via ScorerVisitor and try to call TermScorer.freq() in collect() method: public void collect( int doc) throws IOException { int freq = 0; for ( TermQueryScorer tqs : tqsSet ){ Scorer scorer = tqs.scorer; int matchId = scorer.docID(); // matchId isn't expected if ( matchId == doc ){ freq += scorer.freq(); // this line is never executed } } docCounts.put(doc + docBase, freq); collector.collect(doc); } but TermScorer.docID() returns unexpected id and TermScorer.freq() isn't executed (even if I remove "matchId == doc" condition, TermScorer.freq() returns unexpected number anyway).
          Hide
          Michael McCandless added a comment -

          So... the good news is I made a new scorer (basically copied DisjunctionMaxScorer and then tweaked from there) that scores the OR-only case. All tests pass w/ this new scorer.

          And more good news is that if you don't score (I sort by doctitle to do that), you get a speedup – 7.7% in my simplistic test (prefix query unit*, expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force it to use BS2 not BS – the nocommits in the patch).

          But the bad news is with scoring on it's 22.7% slower!

          And, the weird news is, I discovered accidentally that BS2 is much (> 2X) faster for this one query. I think we need to modify the criteria that decides whether to use BS or BS2... maybe when there are lots of lowish-docFreq terms, BS2 is better?

          Show
          Michael McCandless added a comment - So... the good news is I made a new scorer (basically copied DisjunctionMaxScorer and then tweaked from there) that scores the OR-only case. All tests pass w/ this new scorer. And more good news is that if you don't score (I sort by doctitle to do that), you get a speedup – 7.7% in my simplistic test (prefix query unit*, expands to 988 terms, but I force it to do a scoring BQ rewrite, plus force it to use BS2 not BS – the nocommits in the patch). But the bad news is with scoring on it's 22.7% slower! And, the weird news is, I discovered accidentally that BS2 is much (> 2X) faster for this one query. I think we need to modify the criteria that decides whether to use BS or BS2... maybe when there are lots of lowish-docFreq terms, BS2 is better?
          Hide
          Michael McCandless added a comment -

          New patch attached.

          I added Koji's test as a unit test, that fails on trunk but passes now
          with the patch.

          The new scorer is definitely slower if you do want scoring, however,
          it's actually uncommon for this scorer to be used... because BQ will
          use BS when the query is all SHOULD clauses (plus up to 32 NOT).
          Only if there is also 1 or more MUST clauses will this scorer be used,
          or, if the collector does not support out-of-order scoring. I had to
          hack BQ to temporarily turn off BS to test this.

          Insanely, the constant score BQ rewrite does use this scorer, because
          when ConstantScorer invokes the sub-scorer it requires in-order
          scoring. So ConstantScoreQuery and constant score BQ rewrite for the
          MTQs will always see the speedup from this patch.

          So given that it's rare to actually use the scorer, I think ~8% gain
          as seen by "default" usage makes it worthwhile net/net.

          Longer term... we should probably change the Weight.scorer API so that
          you notify it up-front if you need to do scoring. This way
          we can specialize (manually or automatically) the best class, instead
          of waiting to see whether the .score() method is invoked per hit.

          Show
          Michael McCandless added a comment - New patch attached. I added Koji's test as a unit test, that fails on trunk but passes now with the patch. The new scorer is definitely slower if you do want scoring, however, it's actually uncommon for this scorer to be used... because BQ will use BS when the query is all SHOULD clauses (plus up to 32 NOT). Only if there is also 1 or more MUST clauses will this scorer be used, or, if the collector does not support out-of-order scoring. I had to hack BQ to temporarily turn off BS to test this. Insanely, the constant score BQ rewrite does use this scorer, because when ConstantScorer invokes the sub-scorer it requires in-order scoring. So ConstantScoreQuery and constant score BQ rewrite for the MTQs will always see the speedup from this patch. So given that it's rare to actually use the scorer, I think ~8% gain as seen by "default" usage makes it worthwhile net/net. Longer term... we should probably change the Weight.scorer API so that you notify it up-front if you need to do scoring. This way we can specialize (manually or automatically) the best class, instead of waiting to see whether the .score() method is invoked per hit.
          Hide
          Michael McCandless added a comment -

          I haven't yet committed this... I'm not happy w/ the performance hit to BQ queries w/ one or more MUST clauses...

          Show
          Michael McCandless added a comment - I haven't yet committed this... I'm not happy w/ the performance hit to BQ queries w/ one or more MUST clauses...
          Hide
          Michael McCandless added a comment -

          Pushing to 3.2... I'm not happy w/ the perf hit this patch causes.

          Show
          Michael McCandless added a comment - Pushing to 3.2... I'm not happy w/ the perf hit this patch causes.
          Hide
          Robert Muir added a comment -

          bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          We could add a 'needsNavigation' or similar boolean to scorer, and always return your "BS3" in that case?

          if you want to navigate the scorer tree then you declare this boolean up front (somehow)

          Show
          Robert Muir added a comment - We could add a 'needsNavigation' or similar boolean to scorer, and always return your "BS3" in that case? if you want to navigate the scorer tree then you declare this boolean up front (somehow)
          Hide
          Michael McCandless added a comment -

          +1 that's a good idea!

          Then you are free to choose.

          Show
          Michael McCandless added a comment - +1 that's a good idea! Then you are free to choose.
          Hide
          Simon Willnauer added a comment -

          mike what is the status of this. are you going to work on this any time soon?

          Show
          Simon Willnauer added a comment - mike what is the status of this. are you going to work on this any time soon?
          Hide
          Michael McCandless added a comment -

          Alas I don't think so.... this need not block 3.5.0.

          Show
          Michael McCandless added a comment - Alas I don't think so.... this need not block 3.5.0.
          Hide
          Mikhail Khludnev added a comment -

          Hello,

          I used this patch not by applying directly, but introducing ShouldQuery in my codebase which extends BQ and provides "steady"child scorers for disjunction. It works great, but one tests are spinning in infinite loop. My amendment breaks possible infinite loop in constant query scorer:

          
          ConstantScoreQuery.ConstantScorer.score()
          
          @Override
           	protected boolean score(Collector collector, int max, int firstDocID) throws IOException {
           	if (docIdSetIterator instanceof Scorer) {
           	final boolean score = ((Scorer) docIdSetIterator).score(wrapCollector(collector), max, firstDocID);
           	
                   // let's break the loop
                   final boolean result = score && ( docIdSetIterator.docID() != NO_MORE_DOCS);
           	return result;
           	} else {
           	return super.score(collector, max, firstDocID);
           	}
           	}
          
          Show
          Mikhail Khludnev added a comment - Hello, I used this patch not by applying directly, but introducing ShouldQuery in my codebase which extends BQ and provides "steady"child scorers for disjunction. It works great, but one tests are spinning in infinite loop. My amendment breaks possible infinite loop in constant query scorer: ConstantScoreQuery.ConstantScorer.score() @Override protected boolean score(Collector collector, int max, int firstDocID) throws IOException { if (docIdSetIterator instanceof Scorer) { final boolean score = ((Scorer) docIdSetIterator).score(wrapCollector(collector), max, firstDocID); // let's break the loop final boolean result = score && ( docIdSetIterator.docID() != NO_MORE_DOCS); return result; } else { return super .score(collector, max, firstDocID); } }
          Hide
          LiLi added a comment -

          I test the speed of this one with old one.
          I used randomly generated term scorers and just call nextDoc() until NO_MORE_DOCS
          the new one is much slower than old one(I don't score the doc)

          Show
          LiLi added a comment - I test the speed of this one with old one. I used randomly generated term scorers and just call nextDoc() until NO_MORE_DOCS the new one is much slower than old one(I don't score the doc)
          Hide
          Michael McCandless added a comment -

          LiLi can you post the test case you created?

          Mikhail, can you post a patch with your ShouldQuery? I don't understand why you needed to modify CSQ.score to break the infinite loop...

          Show
          Michael McCandless added a comment - LiLi can you post the test case you created? Mikhail, can you post a patch with your ShouldQuery? I don't understand why you needed to modify CSQ.score to break the infinite loop...
          Hide
          Uwe Schindler added a comment -

          LUCENE-3505 has the same problem and fixes part of it.

          Show
          Uwe Schindler added a comment - LUCENE-3505 has the same problem and fixes part of it.
          Hide
          Uwe Schindler added a comment -

          By the way, DisjunctionSumScorer used in combination with custom collectors using 3.x-ScorerVisitors or trunk-getChildren also messes up score(), if the collector wants to get the scores of sub-scorers. So both freq() and score() is affected. The main problem is that DisjunctionSumScorer does not leave the subscorers on the actual doc id but forward them too early.

          We could add a 'needsNavigation' or similar boolean to scorer, and always return your "BS3" in that case?

          Wewould need that to make access to subscorers working for should clauses.

          Show
          Uwe Schindler added a comment - By the way, DisjunctionSumScorer used in combination with custom collectors using 3.x-ScorerVisitors or trunk-getChildren also messes up score(), if the collector wants to get the scores of sub-scorers. So both freq() and score() is affected. The main problem is that DisjunctionSumScorer does not leave the subscorers on the actual doc id but forward them too early. We could add a 'needsNavigation' or similar boolean to scorer, and always return your "BS3" in that case? Wewould need that to make access to subscorers working for should clauses.
          Hide
          Hoss Man added a comment -

          bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

          Show
          Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
          Hide
          Robert Muir added a comment -

          Fixed in LUCENE-3505.

          Show
          Robert Muir added a comment - Fixed in LUCENE-3505 .
          Hide
          Michael McCandless added a comment -

          Thanks Robert!

          Show
          Michael McCandless added a comment - Thanks Robert!

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development