Lucene - Core
  1. Lucene - Core
  2. LUCENE-6184

BooleanScorer should better deal with sparse clauses

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The way that BooleanScorer works looks like this:

      for each (window of 2048 docs) {
        for each (optional scorer) {
          scorer.score(window)
        }
      }
      

      This is not efficient for very sparse clauses (doc freq much lower than maxDoc/2048) since we keep on scoring windows of documents that do not match anything. BooleanScorer2 currently performs better in those cases.

      1. LUCENE-6184.patch
        28 kB
        Adrien Grand
      2. LUCENE-6184.patch
        16 kB
        Adrien Grand
      3. LUCENE-6184.patch
        16 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch:

        • BulkScorer now returns a hint on the next matching doc after max
        • BooleanScorer uses this information in order to only score windows of documents where at least one clause matches (by putting the bulk scorers into a priority queue)

        This helps boolean queries with dense clauses since this helped remove the hasMatches optimization which helps not iterate over the bit set if there are no matches but had the drawback of making OrCollector.collect
        heavier.

        And this helps boolean queries with very sparse clauses since they now only collect windows where they have matches.

        Here is the result of the luceneutil benchmark on the 10M wikipedia corpus. I added some tasks to test sparse clauses: VeryLow is for term queries that have a doc freq between 400 and 500, and "VeryLowVeryLow" is a disjunction of 2 such terms:

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                HighSloppyPhrase       32.70      (4.3%)       32.39      (4.0%)   -1.0% (  -8% -    7%)
                         Prefix3      162.73      (5.8%)      161.32      (6.6%)   -0.9% ( -12% -   12%)
                         LowTerm      803.22      (6.2%)      797.47      (6.2%)   -0.7% ( -12% -   12%)
                          IntNRQ       13.84      (6.9%)       13.75      (7.3%)   -0.7% ( -13% -   14%)
                    OrHighNotLow       60.36      (2.7%)       59.96      (3.9%)   -0.7% (  -7% -    6%)
                 LowSloppyPhrase       17.94      (3.0%)       17.82      (2.8%)   -0.7% (  -6% -    5%)
                         VeryLow     6095.14      (5.8%)     6057.73      (5.0%)   -0.6% ( -10% -   10%)
                       LowPhrase      276.59      (2.2%)      274.97      (1.6%)   -0.6% (  -4% -    3%)
                    OrHighNotMed       43.56      (2.6%)       43.32      (3.3%)   -0.6% (  -6% -    5%)
                    OrNotHighLow      924.37      (2.5%)      919.21      (2.4%)   -0.6% (  -5% -    4%)
                      AndHighLow      703.38      (2.9%)      699.62      (3.6%)   -0.5% (  -6% -    6%)
                        Wildcard       93.74      (3.1%)       93.29      (3.0%)   -0.5% (  -6% -    5%)
                 MedSloppyPhrase       79.24      (2.8%)       78.91      (2.3%)   -0.4% (  -5% -    4%)
                    OrNotHighMed      207.14      (2.0%)      206.31      (2.2%)   -0.4% (  -4% -    3%)
                    HighSpanNear       12.56      (0.9%)       12.53      (1.1%)   -0.2% (  -2% -    1%)
                      HighPhrase       13.58      (2.3%)       13.55      (2.1%)   -0.2% (  -4% -    4%)
                   OrHighNotHigh       33.29      (1.6%)       33.24      (2.0%)   -0.2% (  -3% -    3%)
                   OrNotHighHigh       56.10      (1.6%)       56.00      (1.8%)   -0.2% (  -3% -    3%)
                        HighTerm       91.52      (2.6%)       91.37      (2.7%)   -0.2% (  -5% -    5%)
                         Respell       71.63      (5.5%)       71.52      (5.3%)   -0.1% ( -10% -   11%)
                     LowSpanNear       18.17      (1.0%)       18.16      (0.8%)   -0.1% (  -1% -    1%)
                         MedTerm      146.69      (2.5%)      146.56      (3.0%)   -0.1% (  -5% -    5%)
                      AndHighMed      274.22      (2.6%)      274.00      (2.3%)   -0.1% (  -4% -    4%)
                     MedSpanNear       31.01      (0.9%)       31.00      (1.1%)   -0.0% (  -1% -    1%)
                     AndHighHigh       77.34      (1.8%)       77.32      (1.7%)   -0.0% (  -3% -    3%)
                       MedPhrase       19.10      (6.2%)       19.10      (6.2%)    0.0% ( -11% -   13%)
                          Fuzzy2       26.84      (6.8%)       26.88      (7.6%)    0.1% ( -13% -   15%)
                        PKLookup      272.91      (3.1%)      274.16      (2.7%)    0.5% (  -5% -    6%)
                       OrHighMed       59.25     (11.8%)       62.90      (6.5%)    6.2% ( -10% -   27%)
                       OrHighLow       64.54     (11.9%)       68.73      (6.5%)    6.5% ( -10% -   28%)
                      OrHighHigh       42.89     (12.2%)       45.77      (6.9%)    6.7% ( -11% -   29%)
                          Fuzzy1       95.20      (4.2%)      101.65      (5.9%)    6.8% (  -3% -   17%)
                  VeryLowVeryLow     1936.31      (3.2%)     2263.44      (3.3%)   16.9% (  10% -   24%)
        
        Show
        Adrien Grand added a comment - Here is a patch: BulkScorer now returns a hint on the next matching doc after max BooleanScorer uses this information in order to only score windows of documents where at least one clause matches (by putting the bulk scorers into a priority queue) This helps boolean queries with dense clauses since this helped remove the hasMatches optimization which helps not iterate over the bit set if there are no matches but had the drawback of making OrCollector.collect heavier. And this helps boolean queries with very sparse clauses since they now only collect windows where they have matches. Here is the result of the luceneutil benchmark on the 10M wikipedia corpus. I added some tasks to test sparse clauses: VeryLow is for term queries that have a doc freq between 400 and 500, and "VeryLowVeryLow" is a disjunction of 2 such terms: TaskQPS baseline StdDev QPS patch StdDev Pct diff HighSloppyPhrase 32.70 (4.3%) 32.39 (4.0%) -1.0% ( -8% - 7%) Prefix3 162.73 (5.8%) 161.32 (6.6%) -0.9% ( -12% - 12%) LowTerm 803.22 (6.2%) 797.47 (6.2%) -0.7% ( -12% - 12%) IntNRQ 13.84 (6.9%) 13.75 (7.3%) -0.7% ( -13% - 14%) OrHighNotLow 60.36 (2.7%) 59.96 (3.9%) -0.7% ( -7% - 6%) LowSloppyPhrase 17.94 (3.0%) 17.82 (2.8%) -0.7% ( -6% - 5%) VeryLow 6095.14 (5.8%) 6057.73 (5.0%) -0.6% ( -10% - 10%) LowPhrase 276.59 (2.2%) 274.97 (1.6%) -0.6% ( -4% - 3%) OrHighNotMed 43.56 (2.6%) 43.32 (3.3%) -0.6% ( -6% - 5%) OrNotHighLow 924.37 (2.5%) 919.21 (2.4%) -0.6% ( -5% - 4%) AndHighLow 703.38 (2.9%) 699.62 (3.6%) -0.5% ( -6% - 6%) Wildcard 93.74 (3.1%) 93.29 (3.0%) -0.5% ( -6% - 5%) MedSloppyPhrase 79.24 (2.8%) 78.91 (2.3%) -0.4% ( -5% - 4%) OrNotHighMed 207.14 (2.0%) 206.31 (2.2%) -0.4% ( -4% - 3%) HighSpanNear 12.56 (0.9%) 12.53 (1.1%) -0.2% ( -2% - 1%) HighPhrase 13.58 (2.3%) 13.55 (2.1%) -0.2% ( -4% - 4%) OrHighNotHigh 33.29 (1.6%) 33.24 (2.0%) -0.2% ( -3% - 3%) OrNotHighHigh 56.10 (1.6%) 56.00 (1.8%) -0.2% ( -3% - 3%) HighTerm 91.52 (2.6%) 91.37 (2.7%) -0.2% ( -5% - 5%) Respell 71.63 (5.5%) 71.52 (5.3%) -0.1% ( -10% - 11%) LowSpanNear 18.17 (1.0%) 18.16 (0.8%) -0.1% ( -1% - 1%) MedTerm 146.69 (2.5%) 146.56 (3.0%) -0.1% ( -5% - 5%) AndHighMed 274.22 (2.6%) 274.00 (2.3%) -0.1% ( -4% - 4%) MedSpanNear 31.01 (0.9%) 31.00 (1.1%) -0.0% ( -1% - 1%) AndHighHigh 77.34 (1.8%) 77.32 (1.7%) -0.0% ( -3% - 3%) MedPhrase 19.10 (6.2%) 19.10 (6.2%) 0.0% ( -11% - 13%) Fuzzy2 26.84 (6.8%) 26.88 (7.6%) 0.1% ( -13% - 15%) PKLookup 272.91 (3.1%) 274.16 (2.7%) 0.5% ( -5% - 6%) OrHighMed 59.25 (11.8%) 62.90 (6.5%) 6.2% ( -10% - 27%) OrHighLow 64.54 (11.9%) 68.73 (6.5%) 6.5% ( -10% - 28%) OrHighHigh 42.89 (12.2%) 45.77 (6.9%) 6.7% ( -11% - 29%) Fuzzy1 95.20 (4.2%) 101.65 (5.9%) 6.8% ( -3% - 17%) VeryLowVeryLow 1936.31 (3.2%) 2263.44 (3.3%) 16.9% ( 10% - 24%)
        Hide
        Adrien Grand added a comment -

        The reason why Fuzzy1 and Fuzzy2 are faster too is that they rewrite to boolean queries by default, so this optimization helps them too.

        Show
        Adrien Grand added a comment - The reason why Fuzzy1 and Fuzzy2 are faster too is that they rewrite to boolean queries by default, so this optimization helps them too.
        Hide
        Robert Muir added a comment -

        The trickiest part here are the new semantics of the return value. But FilteredQuery has a code comment that maybe should be moved to BulkScorer's docs to help elaborate (there is a small typo too in the comment).

        Can this lead to a better minshouldmatch impl for booleanscorer? I'm just as happy with it removed too, but i know a while ago we benchmarked that BS1 can still be faster for that query, so its just a possibility. Maybe it should just stay as a pure disjunction scorer.

        Show
        Robert Muir added a comment - The trickiest part here are the new semantics of the return value. But FilteredQuery has a code comment that maybe should be moved to BulkScorer's docs to help elaborate (there is a small typo too in the comment). Can this lead to a better minshouldmatch impl for booleanscorer? I'm just as happy with it removed too, but i know a while ago we benchmarked that BS1 can still be faster for that query, so its just a possibility. Maybe it should just stay as a pure disjunction scorer.
        Hide
        Adrien Grand added a comment -

        Updated patch with better documentation of the semantics of BulkScorer.score.

        > Can this lead to a better minshouldmatch impl for booleanscorer?

        I don't think it would work in the general case yet. This change is useful to skip over large numbers of non-matching documents, but it still calls nextDoc() all the time, not advance() so I think BS2 is still a better option for now?

        Show
        Adrien Grand added a comment - Updated patch with better documentation of the semantics of BulkScorer.score. > Can this lead to a better minshouldmatch impl for booleanscorer? I don't think it would work in the general case yet. This change is useful to skip over large numbers of non-matching documents, but it still calls nextDoc() all the time, not advance() so I think BS2 is still a better option for now?
        Hide
        Robert Muir added a comment -

        Right but with the PQ in place, we could consider adding support, maybe by adding 'min' to score() to specify the start of the range. maybe its also a way to remove this crazy logic the default impl:

                int doc = scorer.docID();
                if (doc < 0) {
                  doc = scorer.nextDoc();
                }
        
        Show
        Robert Muir added a comment - Right but with the PQ in place, we could consider adding support, maybe by adding 'min' to score() to specify the start of the range. maybe its also a way to remove this crazy logic the default impl: int doc = scorer.docID(); if (doc < 0) { doc = scorer.nextDoc(); }
        Hide
        Adrien Grand added a comment -

        Yes, I think this way we could handle disjunctions in BooleanScorer (including minShouldMatch)! But this should be a separate issue?

        Show
        Adrien Grand added a comment - Yes, I think this way we could handle disjunctions in BooleanScorer (including minShouldMatch)! But this should be a separate issue?
        Hide
        Robert Muir added a comment -

        yeah, its just a future idea, maybe relevant to the API change of returning int instead of boolean.

        Show
        Robert Muir added a comment - yeah, its just a future idea, maybe relevant to the API change of returning int instead of boolean.
        Hide
        Adrien Grand added a comment -

        Same patch, just adding the suggested API in order to make BulkScorer able to skip. Results of the luceneutil benchmark still look similar:

                      AndHighLow      883.42      (3.5%)      872.51      (3.3%)   -1.2% (  -7% -    5%)
                    OrNotHighLow     1052.93      (4.4%)     1048.44      (4.5%)   -0.4% (  -8% -    8%)
                        PKLookup      277.07      (2.0%)      276.65      (2.1%)   -0.2% (  -4% -    4%)
                      AndHighMed      137.40      (1.9%)      137.30      (2.4%)   -0.1% (  -4% -    4%)
                    HighSpanNear       34.67      (3.1%)       34.65      (3.0%)   -0.0% (  -5% -    6%)
                 LowSloppyPhrase      215.69      (2.5%)      215.61      (2.5%)   -0.0% (  -4% -    5%)
                 MedSloppyPhrase      183.08      (2.5%)      183.11      (2.0%)    0.0% (  -4% -    4%)
                      HighPhrase       26.33      (6.8%)       26.34      (6.8%)    0.0% ( -12% -   14%)
                     AndHighHigh       51.61      (1.8%)       51.64      (2.0%)    0.0% (  -3% -    3%)
                       LowPhrase       74.61      (1.3%)       74.68      (1.4%)    0.1% (  -2% -    2%)
                HighSloppyPhrase       14.94      (5.7%)       14.97      (5.0%)    0.2% (  -9% -   11%)
                       MedPhrase       31.42      (1.1%)       31.47      (1.1%)    0.2% (  -1% -    2%)
                     LowSpanNear       55.89      (2.5%)       56.00      (2.5%)    0.2% (  -4% -    5%)
                         Respell       73.38      (2.4%)       73.54      (2.2%)    0.2% (  -4% -    4%)
                    OrNotHighMed      118.20      (1.6%)      118.66      (1.7%)    0.4% (  -2% -    3%)
                     MedSpanNear       78.17      (3.2%)       78.62      (3.5%)    0.6% (  -5% -    7%)
                   OrHighNotHigh       31.47      (1.8%)       31.66      (1.9%)    0.6% (  -2% -    4%)
                   OrNotHighHigh       50.29      (1.6%)       50.63      (2.0%)    0.7% (  -2% -    4%)
                    OrHighNotMed       82.27      (2.3%)       83.17      (2.3%)    1.1% (  -3% -    5%)
                         VeryLow     6149.21      (4.7%)     6223.22      (5.4%)    1.2% (  -8% -   11%)
                    OrHighNotLow       55.30      (3.2%)       56.25      (2.5%)    1.7% (  -3% -    7%)
                         LowTerm      808.21      (7.3%)      824.32      (4.5%)    2.0% (  -9% -   14%)
                        HighTerm      106.18      (4.3%)      108.63      (3.0%)    2.3% (  -4% -   10%)
                         MedTerm      296.65      (4.2%)      304.42      (2.7%)    2.6% (  -4% -   10%)
                        Wildcard       20.85      (7.5%)       21.50      (5.3%)    3.1% (  -8% -   17%)
                         Prefix3       95.63      (6.2%)       98.81      (5.3%)    3.3% (  -7% -   15%)
                          Fuzzy2       62.12      (9.0%)       64.44     (10.2%)    3.7% ( -14% -   25%)
                          IntNRQ        8.85      (8.9%)        9.21      (6.7%)    4.1% ( -10% -   21%)
                          Fuzzy1      105.42     (11.2%)      116.28      (4.8%)   10.3% (  -5% -   29%)
                       OrHighLow       51.75      (8.2%)       59.92      (8.2%)   15.8% (   0% -   35%)
                      OrHighHigh       32.34      (8.5%)       37.53      (8.5%)   16.0% (   0% -   36%)
                       OrHighMed       16.79      (8.7%)       19.62      (8.8%)   16.8% (   0% -   37%)
                  VeryLowVeryLow     2053.12      (2.3%)     2399.38      (3.2%)   16.9% (  11% -   22%)
        
        Show
        Adrien Grand added a comment - Same patch, just adding the suggested API in order to make BulkScorer able to skip. Results of the luceneutil benchmark still look similar: AndHighLow 883.42 (3.5%) 872.51 (3.3%) -1.2% ( -7% - 5%) OrNotHighLow 1052.93 (4.4%) 1048.44 (4.5%) -0.4% ( -8% - 8%) PKLookup 277.07 (2.0%) 276.65 (2.1%) -0.2% ( -4% - 4%) AndHighMed 137.40 (1.9%) 137.30 (2.4%) -0.1% ( -4% - 4%) HighSpanNear 34.67 (3.1%) 34.65 (3.0%) -0.0% ( -5% - 6%) LowSloppyPhrase 215.69 (2.5%) 215.61 (2.5%) -0.0% ( -4% - 5%) MedSloppyPhrase 183.08 (2.5%) 183.11 (2.0%) 0.0% ( -4% - 4%) HighPhrase 26.33 (6.8%) 26.34 (6.8%) 0.0% ( -12% - 14%) AndHighHigh 51.61 (1.8%) 51.64 (2.0%) 0.0% ( -3% - 3%) LowPhrase 74.61 (1.3%) 74.68 (1.4%) 0.1% ( -2% - 2%) HighSloppyPhrase 14.94 (5.7%) 14.97 (5.0%) 0.2% ( -9% - 11%) MedPhrase 31.42 (1.1%) 31.47 (1.1%) 0.2% ( -1% - 2%) LowSpanNear 55.89 (2.5%) 56.00 (2.5%) 0.2% ( -4% - 5%) Respell 73.38 (2.4%) 73.54 (2.2%) 0.2% ( -4% - 4%) OrNotHighMed 118.20 (1.6%) 118.66 (1.7%) 0.4% ( -2% - 3%) MedSpanNear 78.17 (3.2%) 78.62 (3.5%) 0.6% ( -5% - 7%) OrHighNotHigh 31.47 (1.8%) 31.66 (1.9%) 0.6% ( -2% - 4%) OrNotHighHigh 50.29 (1.6%) 50.63 (2.0%) 0.7% ( -2% - 4%) OrHighNotMed 82.27 (2.3%) 83.17 (2.3%) 1.1% ( -3% - 5%) VeryLow 6149.21 (4.7%) 6223.22 (5.4%) 1.2% ( -8% - 11%) OrHighNotLow 55.30 (3.2%) 56.25 (2.5%) 1.7% ( -3% - 7%) LowTerm 808.21 (7.3%) 824.32 (4.5%) 2.0% ( -9% - 14%) HighTerm 106.18 (4.3%) 108.63 (3.0%) 2.3% ( -4% - 10%) MedTerm 296.65 (4.2%) 304.42 (2.7%) 2.6% ( -4% - 10%) Wildcard 20.85 (7.5%) 21.50 (5.3%) 3.1% ( -8% - 17%) Prefix3 95.63 (6.2%) 98.81 (5.3%) 3.3% ( -7% - 15%) Fuzzy2 62.12 (9.0%) 64.44 (10.2%) 3.7% ( -14% - 25%) IntNRQ 8.85 (8.9%) 9.21 (6.7%) 4.1% ( -10% - 21%) Fuzzy1 105.42 (11.2%) 116.28 (4.8%) 10.3% ( -5% - 29%) OrHighLow 51.75 (8.2%) 59.92 (8.2%) 15.8% ( 0% - 35%) OrHighHigh 32.34 (8.5%) 37.53 (8.5%) 16.0% ( 0% - 36%) OrHighMed 16.79 (8.7%) 19.62 (8.8%) 16.8% ( 0% - 37%) VeryLowVeryLow 2053.12 (2.3%) 2399.38 (3.2%) 16.9% ( 11% - 22%)
        Hide
        ASF subversion and git services added a comment -

        Commit 1653020 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1653020 ]

        LUCENE-6184: Make BooleanScorer only score windows that contain matches.

        Show
        ASF subversion and git services added a comment - Commit 1653020 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1653020 ] LUCENE-6184 : Make BooleanScorer only score windows that contain matches.
        Hide
        ASF subversion and git services added a comment -

        Commit 1653023 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1653023 ]

        LUCENE-6184: Make BooleanScorer only score windows that contain matches.

        Show
        ASF subversion and git services added a comment - Commit 1653023 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1653023 ] LUCENE-6184 : Make BooleanScorer only score windows that contain matches.
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development