Lucene - Core
  1. Lucene - Core
  2. LUCENE-6940

Bulk scoring could speed up MUST_NOT clauses

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today when you have MUST_NOT clauses, the ReqExclScorer is used and needs to check the excluded clauses on every iteration. I suspect we could speed things up by having a BulkScorer that would advance the excluded clause first and then tell the required clause to bulk score up to the next excluded document.

      1. LUCENE-6940.patch
        19 kB
        Adrien Grand
      2. LUCENE-6940.patch
        10 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a quick patch (disclaimer: not commented and not tested) to demonstrate the idea. It makes the new bulk scorer used either:

        • when there is a single FILTER/MUST clause, no SHOULD clauses, and some MUST_NOT clauses
        • or when there are some SHOULD clauses, no FILTER_MUST clauses and some MUST_NOT clauses

        I added some tasks to wikimedium.10M.nostopwords.tasks and ran it through luceneutil. As expected this seems to especially yield a speedup when the negative clauses match many less documents than the positive clauses.

        diff --git a/tasks/wikimedium.10M.nostopwords.tasks b/tasks/wikimedium.10M.nostopwords.tasks
        index 342070c..8991121 100644
        --- a/tasks/wikimedium.10M.nostopwords.tasks
        +++ b/tasks/wikimedium.10M.nostopwords.tasks
        @@ -13361,3 +13361,19 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195
         OrHighNotLow: do -necessities # freq=511178 freq=1195
         OrNotHighLow: -had halfback # freq=1246743 freq=1205
         OrHighNotLow: had -halfback # freq=1246743 freq=1205
        +AllNotHigh: *:* -been # freq=1041183
        +AllNotHigh: *:* -states # freq=1034872
        +AllNotHigh: *:* -time # freq=1032071
        +AllNotHigh: *:* -when # freq=1027487
        +AllNotLow: *:* -factor # freq=37866
        +AllNotLow: *:* -migration # freq=37862
        +AllNotLow: *:* -maintained # freq=37840
        +AllNotLow: *:* -norwegian # freq=37836
        +OrHighHighNotLow: several following -factor # freq=436129 freq=416515 freq=37866
        +OrHighHighNotLow: publisher end -migration # freq=1289029 freq=526636 freq=37862
        +OrHighHighNotLow: 2009 film -maintaine # freq=887702 freq=432758 freq=37840
        +OrHighHighNotLow: http known -norwegian # freq=3493581 freq=607158 freq=37836
        +OrHighLowNotHigh: 2005 jorgensen -been # freq=835460 freq=837 freq=1041183
        +OrHighLowNotHigh: like undivided -states # freq=479390 freq=1512 freq=1034872
        +OrHighLowNotHigh: use coy -time # freq=597053 freq=1198 freq=1032071
        +OrHighLowNotHigh: been highperformanceengines -when # freq=1041183 freq=1155 freq=1027487
        
                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                OrHighLowNotHigh       19.54      (2.6%)       18.59      (4.2%)   -4.9% ( -11% -    1%)
                       OrHighMed       34.32      (3.5%)       33.03      (4.8%)   -3.7% ( -11% -    4%)
                      OrHighHigh       26.95      (3.7%)       25.97      (4.9%)   -3.6% ( -11% -    5%)
                          Fuzzy2       82.74     (16.0%)       80.29     (16.4%)   -3.0% ( -30% -   35%)
                      AndHighLow      502.91      (5.7%)      496.63      (3.0%)   -1.2% (  -9% -    7%)
                      AndHighMed      236.44      (2.9%)      234.34      (2.6%)   -0.9% (  -6% -    4%)
                    OrNotHighMed      222.75      (2.9%)      220.87      (2.4%)   -0.8% (  -5% -    4%)
                         Respell       60.25      (3.0%)       60.38      (2.7%)    0.2% (  -5% -    6%)
                 MedSloppyPhrase       21.73      (2.3%)       21.92      (2.5%)    0.8% (  -3% -    5%)
                          Fuzzy1       57.18      (8.0%)       57.78      (5.8%)    1.1% ( -11% -   16%)
                 LowSloppyPhrase       25.96      (1.9%)       26.24      (2.1%)    1.1% (  -2% -    5%)
                HighSloppyPhrase       29.99      (2.5%)       30.37      (2.7%)    1.3% (  -3% -    6%)
                       MedPhrase       60.11      (2.8%)       61.15      (3.1%)    1.7% (  -4% -    7%)
                     AndHighHigh       32.86      (3.0%)       33.56      (3.0%)    2.1% (  -3% -    8%)
                       LowPhrase       59.36      (2.7%)       60.69      (3.2%)    2.2% (  -3% -    8%)
                       OrHighLow       78.50      (3.6%)       80.33      (4.3%)    2.3% (  -5% -   10%)
                      HighPhrase       17.32      (2.1%)       17.73      (1.9%)    2.4% (  -1% -    6%)
                     LowSpanNear       34.90      (2.8%)       35.75      (2.4%)    2.4% (  -2% -    7%)
                     MedSpanNear       30.83      (2.9%)       31.59      (2.0%)    2.4% (  -2% -    7%)
                    OrNotHighLow      982.57      (4.2%)     1009.18      (2.8%)    2.7% (  -4% -   10%)
                    HighSpanNear       10.39      (3.8%)       10.76      (3.7%)    3.5% (  -3% -   11%)
                        Wildcard       64.30      (4.2%)       67.27      (5.2%)    4.6% (  -4% -   14%)
                        HighTerm      110.90      (5.2%)      117.51      (6.7%)    6.0% (  -5% -   18%)
                         MedTerm      155.42      (5.3%)      165.05      (6.9%)    6.2% (  -5% -   19%)
                   OrNotHighHigh       40.19      (1.9%)       42.69      (3.2%)    6.2% (   1% -   11%)
                         Prefix3       87.35      (6.2%)       93.98      (6.9%)    7.6% (  -5% -   22%)
                         LowTerm      574.81      (9.0%)      625.04      (9.6%)    8.7% (  -9% -   30%)
                          IntNRQ       11.95      (9.1%)       13.31     (11.8%)   11.4% (  -8% -   35%)
                   OrHighNotHigh       50.66      (2.0%)       56.55      (4.3%)   11.6% (   5% -   18%)
                OrHighHighNotLow       27.15      (3.3%)       33.91      (4.9%)   24.9% (  16% -   34%)
                    OrHighNotMed       96.64      (2.7%)      130.20      (8.0%)   34.7% (  23% -   46%)
                    OrHighNotLow       42.44      (4.0%)       62.60     (10.7%)   47.5% (  31% -   64%)
                      AllNotHigh        6.51      (2.9%)       16.76     (26.8%)  157.4% ( 124% -  192%)
                       AllNotLow        7.18      (3.0%)       21.93     (49.2%)  205.3% ( 148% -  265%)
        
        Show
        Adrien Grand added a comment - Here is a quick patch (disclaimer: not commented and not tested) to demonstrate the idea. It makes the new bulk scorer used either: when there is a single FILTER/MUST clause, no SHOULD clauses, and some MUST_NOT clauses or when there are some SHOULD clauses, no FILTER_MUST clauses and some MUST_NOT clauses I added some tasks to wikimedium.10M.nostopwords.tasks and ran it through luceneutil. As expected this seems to especially yield a speedup when the negative clauses match many less documents than the positive clauses. diff --git a/tasks/wikimedium.10M.nostopwords.tasks b/tasks/wikimedium.10M.nostopwords.tasks index 342070c..8991121 100644 --- a/tasks/wikimedium.10M.nostopwords.tasks +++ b/tasks/wikimedium.10M.nostopwords.tasks @@ -13361,3 +13361,19 @@ OrNotHighLow: -do necessities # freq=511178 freq=1195 OrHighNotLow: do -necessities # freq=511178 freq=1195 OrNotHighLow: -had halfback # freq=1246743 freq=1205 OrHighNotLow: had -halfback # freq=1246743 freq=1205 +AllNotHigh: *:* -been # freq=1041183 +AllNotHigh: *:* -states # freq=1034872 +AllNotHigh: *:* -time # freq=1032071 +AllNotHigh: *:* -when # freq=1027487 +AllNotLow: *:* -factor # freq=37866 +AllNotLow: *:* -migration # freq=37862 +AllNotLow: *:* -maintained # freq=37840 +AllNotLow: *:* -norwegian # freq=37836 +OrHighHighNotLow: several following -factor # freq=436129 freq=416515 freq=37866 +OrHighHighNotLow: publisher end -migration # freq=1289029 freq=526636 freq=37862 +OrHighHighNotLow: 2009 film -maintaine # freq=887702 freq=432758 freq=37840 +OrHighHighNotLow: http known -norwegian # freq=3493581 freq=607158 freq=37836 +OrHighLowNotHigh: 2005 jorgensen -been # freq=835460 freq=837 freq=1041183 +OrHighLowNotHigh: like undivided -states # freq=479390 freq=1512 freq=1034872 +OrHighLowNotHigh: use coy -time # freq=597053 freq=1198 freq=1032071 +OrHighLowNotHigh: been highperformanceengines -when # freq=1041183 freq=1155 freq=1027487 TaskQPS baseline StdDev QPS patch StdDev Pct diff OrHighLowNotHigh 19.54 (2.6%) 18.59 (4.2%) -4.9% ( -11% - 1%) OrHighMed 34.32 (3.5%) 33.03 (4.8%) -3.7% ( -11% - 4%) OrHighHigh 26.95 (3.7%) 25.97 (4.9%) -3.6% ( -11% - 5%) Fuzzy2 82.74 (16.0%) 80.29 (16.4%) -3.0% ( -30% - 35%) AndHighLow 502.91 (5.7%) 496.63 (3.0%) -1.2% ( -9% - 7%) AndHighMed 236.44 (2.9%) 234.34 (2.6%) -0.9% ( -6% - 4%) OrNotHighMed 222.75 (2.9%) 220.87 (2.4%) -0.8% ( -5% - 4%) Respell 60.25 (3.0%) 60.38 (2.7%) 0.2% ( -5% - 6%) MedSloppyPhrase 21.73 (2.3%) 21.92 (2.5%) 0.8% ( -3% - 5%) Fuzzy1 57.18 (8.0%) 57.78 (5.8%) 1.1% ( -11% - 16%) LowSloppyPhrase 25.96 (1.9%) 26.24 (2.1%) 1.1% ( -2% - 5%) HighSloppyPhrase 29.99 (2.5%) 30.37 (2.7%) 1.3% ( -3% - 6%) MedPhrase 60.11 (2.8%) 61.15 (3.1%) 1.7% ( -4% - 7%) AndHighHigh 32.86 (3.0%) 33.56 (3.0%) 2.1% ( -3% - 8%) LowPhrase 59.36 (2.7%) 60.69 (3.2%) 2.2% ( -3% - 8%) OrHighLow 78.50 (3.6%) 80.33 (4.3%) 2.3% ( -5% - 10%) HighPhrase 17.32 (2.1%) 17.73 (1.9%) 2.4% ( -1% - 6%) LowSpanNear 34.90 (2.8%) 35.75 (2.4%) 2.4% ( -2% - 7%) MedSpanNear 30.83 (2.9%) 31.59 (2.0%) 2.4% ( -2% - 7%) OrNotHighLow 982.57 (4.2%) 1009.18 (2.8%) 2.7% ( -4% - 10%) HighSpanNear 10.39 (3.8%) 10.76 (3.7%) 3.5% ( -3% - 11%) Wildcard 64.30 (4.2%) 67.27 (5.2%) 4.6% ( -4% - 14%) HighTerm 110.90 (5.2%) 117.51 (6.7%) 6.0% ( -5% - 18%) MedTerm 155.42 (5.3%) 165.05 (6.9%) 6.2% ( -5% - 19%) OrNotHighHigh 40.19 (1.9%) 42.69 (3.2%) 6.2% ( 1% - 11%) Prefix3 87.35 (6.2%) 93.98 (6.9%) 7.6% ( -5% - 22%) LowTerm 574.81 (9.0%) 625.04 (9.6%) 8.7% ( -9% - 30%) IntNRQ 11.95 (9.1%) 13.31 (11.8%) 11.4% ( -8% - 35%) OrHighNotHigh 50.66 (2.0%) 56.55 (4.3%) 11.6% ( 5% - 18%) OrHighHighNotLow 27.15 (3.3%) 33.91 (4.9%) 24.9% ( 16% - 34%) OrHighNotMed 96.64 (2.7%) 130.20 (8.0%) 34.7% ( 23% - 46%) OrHighNotLow 42.44 (4.0%) 62.60 (10.7%) 47.5% ( 31% - 64%) AllNotHigh 6.51 (2.9%) 16.76 (26.8%) 157.4% ( 124% - 192%) AllNotLow 7.18 (3.0%) 21.93 (49.2%) 205.3% ( 148% - 265%)
        Hide
        Adrien Grand added a comment -

        Here is a new patch. This time it has tests and tries to organize the code a bit better. Tests pass and luceneutil still reports similar times (this time I only ran the default tasks for wikimedium10m):

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                   OrNotHighHigh       35.34      (2.2%)       32.99      (3.3%)   -6.7% ( -11% -   -1%)
                    OrNotHighMed      174.57      (2.3%)      170.17      (1.8%)   -2.5% (  -6% -    1%)
                       OrHighMed       73.34      (4.6%)       72.13      (5.0%)   -1.7% ( -10% -    8%)
                      OrHighHigh        9.17      (5.7%)        9.03      (5.2%)   -1.5% ( -11% -    9%)
                   OrHighNotHigh       78.91      (2.8%)       78.16      (3.6%)   -0.9% (  -7% -    5%)
                       OrHighLow       19.79      (2.7%)       19.73      (3.6%)   -0.3% (  -6% -    6%)
                 LowSloppyPhrase       70.26      (2.1%)       70.14      (2.3%)   -0.2% (  -4% -    4%)
                      AndHighMed      191.70      (1.7%)      191.59      (1.7%)   -0.1% (  -3% -    3%)
                     AndHighHigh       79.74      (1.0%)       79.79      (0.9%)    0.1% (  -1% -    1%)
                 MedSloppyPhrase       73.28      (2.5%)       73.37      (2.7%)    0.1% (  -5% -    5%)
                         Respell       84.33      (2.4%)       84.60      (2.8%)    0.3% (  -4% -    5%)
                     LowSpanNear       12.79      (3.9%)       12.83      (3.0%)    0.4% (  -6% -    7%)
                       LowPhrase       48.59      (1.2%)       48.79      (1.2%)    0.4% (  -2% -    2%)
                       MedPhrase       33.55      (1.4%)       33.71      (1.3%)    0.5% (  -2% -    3%)
                    HighSpanNear       14.50      (3.2%)       14.60      (2.6%)    0.7% (  -4% -    6%)
                     MedSpanNear      151.02      (3.3%)      152.17      (1.7%)    0.8% (  -4% -    5%)
                HighSloppyPhrase       15.76      (5.3%)       15.90      (5.2%)    0.9% (  -9% -   12%)
                      HighPhrase       32.51      (2.3%)       33.09      (1.4%)    1.8% (  -1% -    5%)
                         Prefix3       90.59      (8.9%)       92.74      (7.5%)    2.4% ( -12% -   20%)
                        Wildcard      125.13      (8.2%)      128.21      (7.8%)    2.5% ( -12% -   20%)
                         MedTerm      291.05      (6.8%)      300.34      (6.5%)    3.2% (  -9% -   17%)
                          Fuzzy1       61.93      (8.8%)       64.08      (9.6%)    3.5% ( -13% -   23%)
                        HighTerm       79.63      (7.3%)       83.28      (6.9%)    4.6% (  -8% -   20%)
                          IntNRQ       10.39     (13.8%)       10.94     (11.5%)    5.3% ( -17% -   35%)
                         LowTerm      575.82     (12.7%)      607.32     (10.6%)    5.5% ( -15% -   33%)
                    OrNotHighLow      985.95      (4.4%)     1054.73      (3.0%)    7.0% (   0% -   15%)
                      AndHighLow      688.12      (8.2%)      736.65      (4.5%)    7.1% (  -5% -   21%)
                          Fuzzy2       58.94     (14.4%)       63.15      (8.9%)    7.2% ( -14% -   35%)
                    OrHighNotMed       84.50      (3.4%)       95.46      (3.8%)   13.0% (   5% -   20%)
                    OrHighNotLow       64.23      (3.3%)       76.36      (4.7%)   18.9% (  10% -   27%)
        
        Show
        Adrien Grand added a comment - Here is a new patch. This time it has tests and tries to organize the code a bit better. Tests pass and luceneutil still reports similar times (this time I only ran the default tasks for wikimedium10m): TaskQPS baseline StdDev QPS patch StdDev Pct diff OrNotHighHigh 35.34 (2.2%) 32.99 (3.3%) -6.7% ( -11% - -1%) OrNotHighMed 174.57 (2.3%) 170.17 (1.8%) -2.5% ( -6% - 1%) OrHighMed 73.34 (4.6%) 72.13 (5.0%) -1.7% ( -10% - 8%) OrHighHigh 9.17 (5.7%) 9.03 (5.2%) -1.5% ( -11% - 9%) OrHighNotHigh 78.91 (2.8%) 78.16 (3.6%) -0.9% ( -7% - 5%) OrHighLow 19.79 (2.7%) 19.73 (3.6%) -0.3% ( -6% - 6%) LowSloppyPhrase 70.26 (2.1%) 70.14 (2.3%) -0.2% ( -4% - 4%) AndHighMed 191.70 (1.7%) 191.59 (1.7%) -0.1% ( -3% - 3%) AndHighHigh 79.74 (1.0%) 79.79 (0.9%) 0.1% ( -1% - 1%) MedSloppyPhrase 73.28 (2.5%) 73.37 (2.7%) 0.1% ( -5% - 5%) Respell 84.33 (2.4%) 84.60 (2.8%) 0.3% ( -4% - 5%) LowSpanNear 12.79 (3.9%) 12.83 (3.0%) 0.4% ( -6% - 7%) LowPhrase 48.59 (1.2%) 48.79 (1.2%) 0.4% ( -2% - 2%) MedPhrase 33.55 (1.4%) 33.71 (1.3%) 0.5% ( -2% - 3%) HighSpanNear 14.50 (3.2%) 14.60 (2.6%) 0.7% ( -4% - 6%) MedSpanNear 151.02 (3.3%) 152.17 (1.7%) 0.8% ( -4% - 5%) HighSloppyPhrase 15.76 (5.3%) 15.90 (5.2%) 0.9% ( -9% - 12%) HighPhrase 32.51 (2.3%) 33.09 (1.4%) 1.8% ( -1% - 5%) Prefix3 90.59 (8.9%) 92.74 (7.5%) 2.4% ( -12% - 20%) Wildcard 125.13 (8.2%) 128.21 (7.8%) 2.5% ( -12% - 20%) MedTerm 291.05 (6.8%) 300.34 (6.5%) 3.2% ( -9% - 17%) Fuzzy1 61.93 (8.8%) 64.08 (9.6%) 3.5% ( -13% - 23%) HighTerm 79.63 (7.3%) 83.28 (6.9%) 4.6% ( -8% - 20%) IntNRQ 10.39 (13.8%) 10.94 (11.5%) 5.3% ( -17% - 35%) LowTerm 575.82 (12.7%) 607.32 (10.6%) 5.5% ( -15% - 33%) OrNotHighLow 985.95 (4.4%) 1054.73 (3.0%) 7.0% ( 0% - 15%) AndHighLow 688.12 (8.2%) 736.65 (4.5%) 7.1% ( -5% - 21%) Fuzzy2 58.94 (14.4%) 63.15 (8.9%) 7.2% ( -14% - 35%) OrHighNotMed 84.50 (3.4%) 95.46 (3.8%) 13.0% ( 5% - 20%) OrHighNotLow 64.23 (3.3%) 76.36 (4.7%) 18.9% ( 10% - 27%)
        Hide
        ASF subversion and git services added a comment -

        Commit 1722443 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1722443 ]

        LUCENE-6940: Speed up MUST_NOT clauses.

        Show
        ASF subversion and git services added a comment - Commit 1722443 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1722443 ] LUCENE-6940 : Speed up MUST_NOT clauses.
        Hide
        ASF subversion and git services added a comment -

        Commit 1722445 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1722445 ]

        LUCENE-6940: Speed up MUST_NOT clauses.

        Show
        ASF subversion and git services added a comment - Commit 1722445 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1722445 ] LUCENE-6940 : Speed up MUST_NOT clauses.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development