Lucene - Core
  1. Lucene - Core
  2. LUCENE-6919

Change the Scorer API to expose an iterator instead of extending DocIdSetIterator

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I was working on trying to address the performance regression on LUCENE-6815 but this is hard to do without introducing specialization of DisjunctionScorer which I'd like to avoid at all costs.

      I think the performance regression would be easy to address without specialization if Scorers were changed to return an iterator instead of extending DocIdSetIterator. So conceptually the API would move from

      class Scorer extends DocIdSetIterator {
      }
      

      to

      class Scorer {
        DocIdSetIterator iterator();
      }
      

      This would help me because then if none of the sub clauses support two-phase iteration, DisjunctionScorer could directly return the approximation as an iterator instead of having to check if twoPhase == null at every iteration.

      Such an approach could also help remove some method calls. For instance TermScorer.nextDoc calls PostingsEnum.nextDoc but with this change TermScorer.iterator() could return the PostingsEnum and TermScorer would not even appear in stack traces when scoring. I hacked a patch to see how much that would help and luceneutil seems to like the change:

                          TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                        Fuzzy1       88.54     (15.7%)       86.73     (16.6%)   -2.0% ( -29% -   35%)
                    AndHighLow      698.98      (4.1%)      691.11      (5.1%)   -1.1% (  -9% -    8%)
                        Fuzzy2       26.47     (11.2%)       26.28     (10.3%)   -0.7% ( -19% -   23%)
                   MedSpanNear      141.03      (3.3%)      140.51      (3.2%)   -0.4% (  -6% -    6%)
                    HighPhrase       60.66      (2.6%)       60.48      (3.3%)   -0.3% (  -5% -    5%)
                   LowSpanNear       29.25      (2.4%)       29.21      (2.1%)   -0.1% (  -4% -    4%)
                     MedPhrase       28.32      (1.9%)       28.28      (2.0%)   -0.1% (  -3% -    3%)
                     LowPhrase       17.31      (2.1%)       17.29      (2.6%)   -0.1% (  -4% -    4%)
              HighSloppyPhrase       10.93      (6.0%)       10.92      (6.0%)   -0.1% ( -11% -   12%)
               MedSloppyPhrase       72.21      (2.2%)       72.27      (1.8%)    0.1% (  -3% -    4%)
                       Respell       57.35      (3.2%)       57.41      (3.4%)    0.1% (  -6% -    6%)
                  HighSpanNear       26.71      (3.0%)       26.75      (2.5%)    0.1% (  -5% -    5%)
                  OrNotHighLow      803.46      (3.4%)      807.03      (4.2%)    0.4% (  -6% -    8%)
               LowSloppyPhrase       88.02      (3.4%)       88.77      (2.5%)    0.8% (  -4% -    7%)
                  OrNotHighMed      200.45      (2.7%)      203.83      (2.5%)    1.7% (  -3% -    7%)
                    OrHighHigh       38.98      (7.9%)       40.30      (6.6%)    3.4% ( -10% -   19%)
                      HighTerm       92.53      (5.3%)       95.94      (5.8%)    3.7% (  -7% -   15%)
                     OrHighMed       53.80      (7.7%)       55.79      (6.6%)    3.7% (  -9% -   19%)
                    AndHighMed      266.69      (1.7%)      277.15      (2.5%)    3.9% (   0% -    8%)
                       Prefix3       44.68      (5.4%)       46.60      (7.0%)    4.3% (  -7% -   17%)
                       MedTerm      261.52      (4.9%)      273.52      (5.4%)    4.6% (  -5% -   15%)
                      Wildcard       42.39      (6.1%)       44.35      (7.8%)    4.6% (  -8% -   19%)
                        IntNRQ       10.46      (7.0%)       10.99      (9.5%)    5.0% ( -10% -   23%)
                 OrNotHighHigh       67.15      (4.6%)       70.65      (4.5%)    5.2% (  -3% -   15%)
                 OrHighNotHigh       43.07      (5.1%)       45.36      (5.4%)    5.3% (  -4% -   16%)
                     OrHighLow       64.19      (6.4%)       67.72      (5.5%)    5.5% (  -6% -   18%)
                   AndHighHigh       64.17      (2.3%)       67.87      (2.1%)    5.8% (   1% -   10%)
                       LowTerm      642.94     (10.9%)      681.48      (8.5%)    6.0% ( -12% -   28%)
                  OrHighNotMed       12.68      (6.9%)       13.51      (6.6%)    6.5% (  -6% -   21%)
                  OrHighNotLow       54.69      (6.8%)       58.25      (7.0%)    6.5% (  -6% -   21%)
      
      1. LUCENE-6919.patch
        243 kB
        Adrien Grand
      2. LUCENE-6919.patch
        240 kB
        Adrien Grand
      3. LUCENE-6919.patch
        6 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is the (hacky) patch that I used for the benchmark.

        This would be a fairly large change, so I'd like to get feedback before trying to actually do it. If you don't like this new API, please let me know.

        Show
        Adrien Grand added a comment - Here is the (hacky) patch that I used for the benchmark. This would be a fairly large change, so I'd like to get feedback before trying to actually do it. If you don't like this new API, please let me know.
        Hide
        Ryan Ernst added a comment -

        +1 to the idea

        Show
        Ryan Ernst added a comment - +1 to the idea
        Hide
        David Smiley added a comment -

        +1 because it seems to perform better. And I could see it simplifying some scorer implementations that no longer need to delegate various methods to a DISI, like ValueSourceScorer.

        I think it'll be interesting to re-assess after you post the real patch – to see wether it made any other code more painful. I suppose both the proposed Scorer.iterator() and also TwoPhaseIterator.approximation() both return a live stateful reference; i.e. it's positioned and not from the beginning. The docs for both methods should state that to make it clear. And I suspect it may be useful to define a convenience method of Scorer.docID() as iterator().docID() since I think it's called a lot; but I may be wrong on that. Your call.

        Show
        David Smiley added a comment - +1 because it seems to perform better. And I could see it simplifying some scorer implementations that no longer need to delegate various methods to a DISI, like ValueSourceScorer. I think it'll be interesting to re-assess after you post the real patch – to see wether it made any other code more painful. I suppose both the proposed Scorer.iterator() and also TwoPhaseIterator.approximation() both return a live stateful reference; i.e. it's positioned and not from the beginning. The docs for both methods should state that to make it clear. And I suspect it may be useful to define a convenience method of Scorer.docID() as iterator().docID() since I think it's called a lot; but I may be wrong on that. Your call.
        Hide
        Adrien Grand added a comment -

        Good point about Scorer.docId(). I think it's also better to not require Collectors to go though the iterator since they are not supposed to move the iterator. Once this change is in, maybe in the future we can expose a smaller interface in Collector.setScorer (that only exposes a doc ID, a freq and a score, as suggested in LUCENE-6228).

        Here is a patch with the proposed changes. It makes some things slightly less complicated due to the additional level of indirection between a Weight and a DocIdSetIterator (eg. for delete-by-query) but it also makes some Scorer implementations simpler since they can now just return the iterator instead of reimplementing the whole DISI API and forwarding it to an existing DocIdSetIterator (for instance this is what conjunctions did).

        For consistency, Scorer.asTwoPhaseIterator has been renamed to Scorer.twoPhaseIterator. So now Scorer has Scorer.iterator() which is a required method and returns a DocIdSetIterator and Scorer.twoPhaseIterator which is an optional method and returns a TwoPhaseIterator.

        The luceneutil report is consistent with what I got with the hacky patch:

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                    OrNotHighLow     1119.21      (3.9%)     1078.59      (5.3%)   -3.6% ( -12% -    5%)
                      AndHighLow      796.76      (4.9%)      774.43      (5.6%)   -2.8% ( -12% -    8%)
                 MedSloppyPhrase       58.28      (4.0%)       57.67      (4.9%)   -1.0% (  -9% -    8%)
                     LowSpanNear      122.71      (2.0%)      121.65      (2.8%)   -0.9% (  -5% -    4%)
                    HighSpanNear        3.25      (2.3%)        3.22      (2.0%)   -0.8% (  -4% -    3%)
                       LowPhrase      113.56      (1.5%)      112.73      (2.1%)   -0.7% (  -4% -    2%)
                     MedSpanNear      109.63      (2.4%)      109.15      (1.7%)   -0.4% (  -4% -    3%)
                HighSloppyPhrase        6.23      (5.0%)        6.22      (5.4%)   -0.2% ( -10% -   10%)
                    OrHighNotLow      102.42      (3.8%)      102.32      (3.5%)   -0.1% (  -7% -    7%)
                 LowSloppyPhrase       22.01      (2.3%)       22.01      (2.8%)    0.0% (  -4% -    5%)
                       MedPhrase       15.92      (1.6%)       15.94      (1.8%)    0.1% (  -3% -    3%)
                      HighPhrase       34.63      (3.1%)       34.75      (3.2%)    0.3% (  -5% -    6%)
                    OrNotHighMed      141.69      (3.3%)      142.75      (3.3%)    0.7% (  -5% -    7%)
                   OrNotHighHigh       50.74      (2.1%)       51.15      (2.7%)    0.8% (  -3% -    5%)
                         Respell       63.24      (3.2%)       64.05      (3.1%)    1.3% (  -4% -    7%)
                   OrHighNotHigh       42.37      (2.8%)       42.92      (3.0%)    1.3% (  -4% -    7%)
                    OrHighNotMed       80.74      (2.9%)       82.18      (2.9%)    1.8% (  -3% -    7%)
                         Prefix3      151.13      (4.7%)      155.37      (6.7%)    2.8% (  -8% -   14%)
                     AndHighHigh       36.96      (2.3%)       38.37      (2.3%)    3.8% (   0% -    8%)
                          Fuzzy1       25.95      (5.9%)       27.00      (5.7%)    4.0% (  -7% -   16%)
                       OrHighMed       50.05      (5.0%)       52.10      (5.7%)    4.1% (  -6% -   15%)
                      OrHighHigh       33.64      (5.2%)       35.16      (4.7%)    4.5% (  -5% -   15%)
                          IntNRQ       10.93      (6.9%)       11.46      (6.2%)    4.8% (  -7% -   19%)
                         MedTerm      179.51      (3.8%)      188.22      (3.9%)    4.9% (  -2% -   13%)
                       OrHighLow       79.55      (2.9%)       83.56      (2.8%)    5.0% (   0% -   11%)
                         LowTerm      682.13      (8.0%)      716.84      (6.4%)    5.1% (  -8% -   21%)
                      AndHighMed      114.21      (2.4%)      120.06      (2.4%)    5.1% (   0% -   10%)
                        Wildcard       29.31      (6.4%)       31.07      (5.8%)    6.0% (  -5% -   19%)
                        HighTerm      118.05      (3.5%)      125.83      (4.4%)    6.6% (  -1% -   14%)
                          Fuzzy2       61.23     (20.9%)       67.19     (21.3%)    9.7% ( -26% -   65%)
        
        Show
        Adrien Grand added a comment - Good point about Scorer.docId(). I think it's also better to not require Collectors to go though the iterator since they are not supposed to move the iterator. Once this change is in, maybe in the future we can expose a smaller interface in Collector.setScorer (that only exposes a doc ID, a freq and a score, as suggested in LUCENE-6228 ). Here is a patch with the proposed changes. It makes some things slightly less complicated due to the additional level of indirection between a Weight and a DocIdSetIterator (eg. for delete-by-query) but it also makes some Scorer implementations simpler since they can now just return the iterator instead of reimplementing the whole DISI API and forwarding it to an existing DocIdSetIterator (for instance this is what conjunctions did). For consistency, Scorer.asTwoPhaseIterator has been renamed to Scorer.twoPhaseIterator. So now Scorer has Scorer.iterator() which is a required method and returns a DocIdSetIterator and Scorer.twoPhaseIterator which is an optional method and returns a TwoPhaseIterator. The luceneutil report is consistent with what I got with the hacky patch: TaskQPS baseline StdDev QPS patch StdDev Pct diff OrNotHighLow 1119.21 (3.9%) 1078.59 (5.3%) -3.6% ( -12% - 5%) AndHighLow 796.76 (4.9%) 774.43 (5.6%) -2.8% ( -12% - 8%) MedSloppyPhrase 58.28 (4.0%) 57.67 (4.9%) -1.0% ( -9% - 8%) LowSpanNear 122.71 (2.0%) 121.65 (2.8%) -0.9% ( -5% - 4%) HighSpanNear 3.25 (2.3%) 3.22 (2.0%) -0.8% ( -4% - 3%) LowPhrase 113.56 (1.5%) 112.73 (2.1%) -0.7% ( -4% - 2%) MedSpanNear 109.63 (2.4%) 109.15 (1.7%) -0.4% ( -4% - 3%) HighSloppyPhrase 6.23 (5.0%) 6.22 (5.4%) -0.2% ( -10% - 10%) OrHighNotLow 102.42 (3.8%) 102.32 (3.5%) -0.1% ( -7% - 7%) LowSloppyPhrase 22.01 (2.3%) 22.01 (2.8%) 0.0% ( -4% - 5%) MedPhrase 15.92 (1.6%) 15.94 (1.8%) 0.1% ( -3% - 3%) HighPhrase 34.63 (3.1%) 34.75 (3.2%) 0.3% ( -5% - 6%) OrNotHighMed 141.69 (3.3%) 142.75 (3.3%) 0.7% ( -5% - 7%) OrNotHighHigh 50.74 (2.1%) 51.15 (2.7%) 0.8% ( -3% - 5%) Respell 63.24 (3.2%) 64.05 (3.1%) 1.3% ( -4% - 7%) OrHighNotHigh 42.37 (2.8%) 42.92 (3.0%) 1.3% ( -4% - 7%) OrHighNotMed 80.74 (2.9%) 82.18 (2.9%) 1.8% ( -3% - 7%) Prefix3 151.13 (4.7%) 155.37 (6.7%) 2.8% ( -8% - 14%) AndHighHigh 36.96 (2.3%) 38.37 (2.3%) 3.8% ( 0% - 8%) Fuzzy1 25.95 (5.9%) 27.00 (5.7%) 4.0% ( -7% - 16%) OrHighMed 50.05 (5.0%) 52.10 (5.7%) 4.1% ( -6% - 15%) OrHighHigh 33.64 (5.2%) 35.16 (4.7%) 4.5% ( -5% - 15%) IntNRQ 10.93 (6.9%) 11.46 (6.2%) 4.8% ( -7% - 19%) MedTerm 179.51 (3.8%) 188.22 (3.9%) 4.9% ( -2% - 13%) OrHighLow 79.55 (2.9%) 83.56 (2.8%) 5.0% ( 0% - 11%) LowTerm 682.13 (8.0%) 716.84 (6.4%) 5.1% ( -8% - 21%) AndHighMed 114.21 (2.4%) 120.06 (2.4%) 5.1% ( 0% - 10%) Wildcard 29.31 (6.4%) 31.07 (5.8%) 6.0% ( -5% - 19%) HighTerm 118.05 (3.5%) 125.83 (4.4%) 6.6% ( -1% - 14%) Fuzzy2 61.23 (20.9%) 67.19 (21.3%) 9.7% ( -26% - 65%)
        Hide
        Paul Elschot added a comment -

        Better performance, and about 70 lines less code in spite of splitting off SpanScorer from Spans.

        LGTM, except for one minor point: the javadocs of Scorer.docID() say that there is a default implementation, but the method is abstract there.

        Show
        Paul Elschot added a comment - Better performance, and about 70 lines less code in spite of splitting off SpanScorer from Spans. LGTM, except for one minor point: the javadocs of Scorer.docID() say that there is a default implementation, but the method is abstract there.
        Hide
        Alan Woodward added a comment -

        I think now that scoring has moved back out of Spans into a special SpanScorer, we don't need the docScorer in there anymore?

        Show
        Alan Woodward added a comment - I think now that scoring has moved back out of Spans into a special SpanScorer, we don't need the docScorer in there anymore?
        Hide
        Adrien Grand added a comment -

        Good catch Paul. Early iterations of this patch had a default impl that delegated to iterator().docId() but I figured out this could be a performance trap since some Scorer impls create an object in iterator() so I made it abstract instead. I fixed the docs.

        Good catch Alan as well, I could easily remove docScorer from Spans.

        Show
        Adrien Grand added a comment - Good catch Paul. Early iterations of this patch had a default impl that delegated to iterator().docId() but I figured out this could be a performance trap since some Scorer impls create an object in iterator() so I made it abstract instead. I fixed the docs. Good catch Alan as well, I could easily remove docScorer from Spans.
        Hide
        Paul Elschot added a comment -

        I think you reposted the previous patch...

        Show
        Paul Elschot added a comment - I think you reposted the previous patch...
        Hide
        Adrien Grand added a comment -

        Woops, here is the actual patch I wanted to upload.

        Show
        Adrien Grand added a comment - Woops, here is the actual patch I wanted to upload.
        Hide
        Paul Elschot added a comment -

        LGTM, nice simplification from the removal of docScorer from Spans.

        Show
        Paul Elschot added a comment - LGTM, nice simplification from the removal of docScorer from Spans.
        Hide
        ASF subversion and git services added a comment -

        Commit 1719081 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1719081 ]

        LUCENE-6919: Make Scorer expose an iterator instead of extending DocIdSetIterator.

        Show
        ASF subversion and git services added a comment - Commit 1719081 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1719081 ] LUCENE-6919 : Make Scorer expose an iterator instead of extending DocIdSetIterator.
        Hide
        ASF subversion and git services added a comment -

        Commit 1719124 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1719124 ]

        LUCENE-6919: Make Scorer expose an iterator instead of extending DocIdSetIterator.

        Show
        ASF subversion and git services added a comment - Commit 1719124 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1719124 ] LUCENE-6919 : Make Scorer expose an iterator instead of extending DocIdSetIterator.
        Hide
        ASF subversion and git services added a comment -

        Commit 1721433 from Christine Poerschke in branch 'dev/trunk'
        [ https://svn.apache.org/r1721433 ]

        LUCENE-6919: In TestConjunctionDISI.java removed now redundant cast to Scorer.

        Show
        ASF subversion and git services added a comment - Commit 1721433 from Christine Poerschke in branch 'dev/trunk' [ https://svn.apache.org/r1721433 ] LUCENE-6919 : In TestConjunctionDISI.java removed now redundant cast to Scorer.
        Hide
        Adrien Grand added a comment -

        Thanks Christine.

        Show
        Adrien Grand added a comment - Thanks Christine.
        Hide
        Christine Poerschke added a comment -

        No problem. I just randomly noticed the warning whilst compiling something unrelated.

        Show
        Christine Poerschke added a comment - No problem. I just randomly noticed the warning whilst compiling something unrelated.
        Hide
        ASF subversion and git services added a comment -

        Commit 1721436 from Christine Poerschke in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1721436 ]

        LUCENE-6919: In TestConjunctionDISI.java removed now redundant cast to Scorer. (merge in revision 1721433 from trunk)

        Show
        ASF subversion and git services added a comment - Commit 1721436 from Christine Poerschke in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1721436 ] LUCENE-6919 : In TestConjunctionDISI.java removed now redundant cast to Scorer. (merge in revision 1721433 from trunk)

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development