Lucene - Core
  1. Lucene - Core
  2. LUCENE-6815

Should DisjunctionScorer advance more lazily?

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today if you call DisjunctionScorer.advance(X), it will try to advance all sub scorers to X. However, if DisjunctionScorer is being intersected with another scorer (which is almost always the case as we use BooleanScorer for top-level disjunctions), we could stop as soon as we find one matching sub scorer, and only advance the remaining sub scorers when freq() or score() is called.

      1. LUCENE-6815.patch
        18 kB
        Adrien Grand
      2. LUCENE-6815.patch
        17 kB
        Adrien Grand
      3. LUCENE-6815.patch
        10 kB
        Adrien Grand

        Activity

        Hide
        Terry Smith added a comment -

        Additionally, I don't see DisiPriorityQueue taking the cost of each scorer into account. I'd imagine that the scorer with highest cost is more likely to be a hit which would make this kind of lazy advancing even better.

        Show
        Terry Smith added a comment - Additionally, I don't see DisiPriorityQueue taking the cost of each scorer into account. I'd imagine that the scorer with highest cost is more likely to be a hit which would make this kind of lazy advancing even better.
        Hide
        Adrien Grand added a comment -

        Indeed. Another cost that would be interesting to take into account is the cost of matching a Scorer (LUCENE-6276) so that we try to match the cheapest scorers first.

        Show
        Adrien Grand added a comment - Indeed. Another cost that would be interesting to take into account is the cost of matching a Scorer ( LUCENE-6276 ) so that we try to match the cheapest scorers first.
        Hide
        David Smiley added a comment -

        Cool idea!

        Show
        David Smiley added a comment - Cool idea!
        Hide
        Adrien Grand added a comment -

        Here is a patch I've been hacking on. Nothing changes in the case that no sub scorer supports two-phase iteration. However, if one or more sub scorers support approximations, matches() will try to match sub scorers in order of cost (currently assuming that the cost is 0 if the scorer does not support two-phase iteration and 1 otherwise) and return as soon as one matches. DisjunctionScorer will only try to match other sub scorers when score() or freq() is called.

        Show
        Adrien Grand added a comment - Here is a patch I've been hacking on. Nothing changes in the case that no sub scorer supports two-phase iteration. However, if one or more sub scorers support approximations, matches() will try to match sub scorers in order of cost (currently assuming that the cost is 0 if the scorer does not support two-phase iteration and 1 otherwise) and return as soon as one matches. DisjunctionScorer will only try to match other sub scorers when score() or freq() is called.
        Hide
        Adrien Grand added a comment -

        Updated patch. This now leverages the new matchCost() API in order to match sub scorers by order of cost. Unfortunately regular disjunctions get a bit slower with this patch:

        (bulk scoring has been disabled for this run)

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                       OrHighLow       60.64      (4.8%)       57.41      (4.0%)   -5.3% ( -13% -    3%)
                      OrHighHigh       17.07      (4.2%)       16.27      (3.5%)   -4.7% ( -11% -    3%)
                       OrHighMed       66.59      (5.1%)       63.57      (3.6%)   -4.5% ( -12% -    4%)
                          Fuzzy1       52.48      (6.8%)       51.35      (8.0%)   -2.2% ( -15% -   13%)
                          Fuzzy2       61.04     (19.3%)       59.90     (18.5%)   -1.9% ( -33% -   44%)
                   OrHighNotHigh       65.22      (2.7%)       64.84      (3.4%)   -0.6% (  -6% -    5%)
                    OrHighNotMed       50.97      (3.8%)       50.71      (4.1%)   -0.5% (  -8% -    7%)
                        HighTerm       68.99      (3.7%)       68.64      (4.0%)   -0.5% (  -7% -    7%)
                    OrNotHighMed      238.28      (2.3%)      237.43      (2.7%)   -0.4% (  -5% -    4%)
                         LowTerm      667.00      (5.0%)      664.74      (5.5%)   -0.3% ( -10% -   10%)
                     AndHighHigh       33.10      (1.4%)       33.01      (1.7%)   -0.3% (  -3% -    2%)
                         MedTerm      185.93      (3.5%)      185.42      (3.8%)   -0.3% (  -7% -    7%)
                        Wildcard       40.77      (3.8%)       40.67      (4.6%)   -0.2% (  -8% -    8%)
                      HighPhrase       38.55      (2.7%)       38.48      (2.6%)   -0.2% (  -5% -    5%)
                    OrHighNotLow       37.43      (3.7%)       37.38      (4.4%)   -0.1% (  -7% -    8%)
                HighSloppyPhrase       17.66      (2.6%)       17.64      (2.6%)   -0.1% (  -5% -    5%)
                       LowPhrase       48.82      (1.1%)       48.81      (1.8%)   -0.0% (  -2% -    2%)
                    HighSpanNear       13.84      (2.8%)       13.83      (2.7%)   -0.0% (  -5% -    5%)
                   OrNotHighHigh       55.47      (2.5%)       55.53      (2.3%)    0.1% (  -4% -    4%)
                       MedPhrase      167.70      (3.0%)      167.86      (3.5%)    0.1% (  -6% -    6%)
                 MedSloppyPhrase       30.44      (1.7%)       30.49      (1.7%)    0.2% (  -3% -    3%)
                 LowSloppyPhrase       72.06      (1.5%)       72.23      (1.2%)    0.2% (  -2% -    3%)
                     LowSpanNear       23.84      (2.2%)       23.96      (2.5%)    0.5% (  -4% -    5%)
                          IntNRQ        8.14      (5.3%)        8.19      (5.5%)    0.5% (  -9% -   12%)
                      AndHighMed      216.29      (1.8%)      217.53      (1.8%)    0.6% (  -2% -    4%)
                      AndHighLow      882.75      (7.1%)      887.86      (9.3%)    0.6% ( -14% -   18%)
                     MedSpanNear       47.85      (2.9%)       48.14      (3.0%)    0.6% (  -5% -    6%)
                         Prefix3       33.98      (3.9%)       34.22      (4.0%)    0.7% (  -6% -    8%)
                    OrNotHighLow      757.50      (5.0%)      763.60      (5.5%)    0.8% (  -9% -   11%)
                         Respell       61.43      (3.3%)       61.94      (2.7%)    0.8% (  -5% -    7%)
        

        I'll look into how I can fix it.

        Show
        Adrien Grand added a comment - Updated patch. This now leverages the new matchCost() API in order to match sub scorers by order of cost. Unfortunately regular disjunctions get a bit slower with this patch: (bulk scoring has been disabled for this run) TaskQPS baseline StdDev QPS patch StdDev Pct diff OrHighLow 60.64 (4.8%) 57.41 (4.0%) -5.3% ( -13% - 3%) OrHighHigh 17.07 (4.2%) 16.27 (3.5%) -4.7% ( -11% - 3%) OrHighMed 66.59 (5.1%) 63.57 (3.6%) -4.5% ( -12% - 4%) Fuzzy1 52.48 (6.8%) 51.35 (8.0%) -2.2% ( -15% - 13%) Fuzzy2 61.04 (19.3%) 59.90 (18.5%) -1.9% ( -33% - 44%) OrHighNotHigh 65.22 (2.7%) 64.84 (3.4%) -0.6% ( -6% - 5%) OrHighNotMed 50.97 (3.8%) 50.71 (4.1%) -0.5% ( -8% - 7%) HighTerm 68.99 (3.7%) 68.64 (4.0%) -0.5% ( -7% - 7%) OrNotHighMed 238.28 (2.3%) 237.43 (2.7%) -0.4% ( -5% - 4%) LowTerm 667.00 (5.0%) 664.74 (5.5%) -0.3% ( -10% - 10%) AndHighHigh 33.10 (1.4%) 33.01 (1.7%) -0.3% ( -3% - 2%) MedTerm 185.93 (3.5%) 185.42 (3.8%) -0.3% ( -7% - 7%) Wildcard 40.77 (3.8%) 40.67 (4.6%) -0.2% ( -8% - 8%) HighPhrase 38.55 (2.7%) 38.48 (2.6%) -0.2% ( -5% - 5%) OrHighNotLow 37.43 (3.7%) 37.38 (4.4%) -0.1% ( -7% - 8%) HighSloppyPhrase 17.66 (2.6%) 17.64 (2.6%) -0.1% ( -5% - 5%) LowPhrase 48.82 (1.1%) 48.81 (1.8%) -0.0% ( -2% - 2%) HighSpanNear 13.84 (2.8%) 13.83 (2.7%) -0.0% ( -5% - 5%) OrNotHighHigh 55.47 (2.5%) 55.53 (2.3%) 0.1% ( -4% - 4%) MedPhrase 167.70 (3.0%) 167.86 (3.5%) 0.1% ( -6% - 6%) MedSloppyPhrase 30.44 (1.7%) 30.49 (1.7%) 0.2% ( -3% - 3%) LowSloppyPhrase 72.06 (1.5%) 72.23 (1.2%) 0.2% ( -2% - 3%) LowSpanNear 23.84 (2.2%) 23.96 (2.5%) 0.5% ( -4% - 5%) IntNRQ 8.14 (5.3%) 8.19 (5.5%) 0.5% ( -9% - 12%) AndHighMed 216.29 (1.8%) 217.53 (1.8%) 0.6% ( -2% - 4%) AndHighLow 882.75 (7.1%) 887.86 (9.3%) 0.6% ( -14% - 18%) MedSpanNear 47.85 (2.9%) 48.14 (3.0%) 0.6% ( -5% - 6%) Prefix3 33.98 (3.9%) 34.22 (4.0%) 0.7% ( -6% - 8%) OrNotHighLow 757.50 (5.0%) 763.60 (5.5%) 0.8% ( -9% - 11%) Respell 61.43 (3.3%) 61.94 (2.7%) 0.8% ( -5% - 7%) I'll look into how I can fix it.
        Hide
        Adrien Grand added a comment -

        Updated patch to work with the new Scorer API.

        luceneutil likes it better as well (bulk scoring was disabled on boolean queries for this run):

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                    OrNotHighLow      836.69      (4.3%)      829.89      (4.2%)   -0.8% (  -8% -    8%)
                         Prefix3       36.35      (5.0%)       36.27      (4.7%)   -0.2% (  -9% -    9%)
                      HighPhrase       51.03      (3.5%)       50.99      (3.9%)   -0.1% (  -7% -    7%)
                     MedSpanNear       14.14      (2.1%)       14.15      (2.4%)    0.1% (  -4% -    4%)
                 LowSloppyPhrase       29.65      (2.5%)       29.68      (2.5%)    0.1% (  -4% -    5%)
                       LowPhrase       18.98      (2.3%)       19.02      (3.2%)    0.2% (  -5% -    5%)
                          IntNRQ        8.07      (6.5%)        8.09      (6.4%)    0.2% ( -11% -   14%)
                       MedPhrase       37.46      (2.1%)       37.54      (2.4%)    0.2% (  -4% -    4%)
                         LowTerm      657.15      (8.1%)      658.77      (8.6%)    0.2% ( -15% -   18%)
                    HighSpanNear       10.35      (3.4%)       10.39      (3.7%)    0.4% (  -6% -    7%)
                    OrHighNotMed       57.75      (6.1%)       57.97      (6.4%)    0.4% ( -11% -   13%)
                HighSloppyPhrase        3.09      (6.4%)        3.10      (6.3%)    0.4% ( -11% -   14%)
                     AndHighHigh      104.21      (2.2%)      104.63      (1.9%)    0.4% (  -3% -    4%)
                        Wildcard      139.23      (4.0%)      139.82      (4.8%)    0.4% (  -7% -    9%)
                    OrNotHighMed      160.13      (3.7%)      161.04      (3.2%)    0.6% (  -6% -    7%)
                         MedTerm      196.83      (5.1%)      198.03      (4.9%)    0.6% (  -8% -   11%)
                 MedSloppyPhrase       28.51      (3.0%)       28.69      (2.6%)    0.6% (  -4% -    6%)
                      AndHighMed      199.74      (2.9%)      201.03      (2.1%)    0.6% (  -4% -    5%)
                   OrNotHighHigh       52.47      (4.0%)       52.81      (3.7%)    0.6% (  -6% -    8%)
                      AndHighLow      831.21      (5.2%)      836.72      (4.4%)    0.7% (  -8% -   10%)
                        HighTerm      106.80      (4.9%)      107.52      (4.5%)    0.7% (  -8% -   10%)
                   OrHighNotHigh       49.22      (4.2%)       49.58      (4.9%)    0.7% (  -7% -   10%)
                    OrHighNotLow       72.74      (6.9%)       73.36      (7.2%)    0.9% ( -12% -   16%)
                     LowSpanNear       40.43      (1.7%)       40.86      (2.0%)    1.1% (  -2% -    4%)
                         Respell       85.26      (4.0%)       86.34      (2.6%)    1.3% (  -5% -    8%)
                          Fuzzy1       76.05     (12.4%)       77.99     (14.1%)    2.6% ( -21% -   33%)
                      OrHighHigh       21.18      (5.5%)       22.00      (5.6%)    3.9% (  -6% -   15%)
                       OrHighMed       44.83      (5.4%)       46.73      (6.2%)    4.2% (  -6% -   16%)
                       OrHighLow       53.74      (5.3%)       56.55      (5.7%)    5.2% (  -5% -   17%)
                          Fuzzy2       49.00     (14.4%)       51.79      (9.5%)    5.7% ( -15% -   34%)
        
        Show
        Adrien Grand added a comment - Updated patch to work with the new Scorer API. luceneutil likes it better as well (bulk scoring was disabled on boolean queries for this run): TaskQPS baseline StdDev QPS patch StdDev Pct diff OrNotHighLow 836.69 (4.3%) 829.89 (4.2%) -0.8% ( -8% - 8%) Prefix3 36.35 (5.0%) 36.27 (4.7%) -0.2% ( -9% - 9%) HighPhrase 51.03 (3.5%) 50.99 (3.9%) -0.1% ( -7% - 7%) MedSpanNear 14.14 (2.1%) 14.15 (2.4%) 0.1% ( -4% - 4%) LowSloppyPhrase 29.65 (2.5%) 29.68 (2.5%) 0.1% ( -4% - 5%) LowPhrase 18.98 (2.3%) 19.02 (3.2%) 0.2% ( -5% - 5%) IntNRQ 8.07 (6.5%) 8.09 (6.4%) 0.2% ( -11% - 14%) MedPhrase 37.46 (2.1%) 37.54 (2.4%) 0.2% ( -4% - 4%) LowTerm 657.15 (8.1%) 658.77 (8.6%) 0.2% ( -15% - 18%) HighSpanNear 10.35 (3.4%) 10.39 (3.7%) 0.4% ( -6% - 7%) OrHighNotMed 57.75 (6.1%) 57.97 (6.4%) 0.4% ( -11% - 13%) HighSloppyPhrase 3.09 (6.4%) 3.10 (6.3%) 0.4% ( -11% - 14%) AndHighHigh 104.21 (2.2%) 104.63 (1.9%) 0.4% ( -3% - 4%) Wildcard 139.23 (4.0%) 139.82 (4.8%) 0.4% ( -7% - 9%) OrNotHighMed 160.13 (3.7%) 161.04 (3.2%) 0.6% ( -6% - 7%) MedTerm 196.83 (5.1%) 198.03 (4.9%) 0.6% ( -8% - 11%) MedSloppyPhrase 28.51 (3.0%) 28.69 (2.6%) 0.6% ( -4% - 6%) AndHighMed 199.74 (2.9%) 201.03 (2.1%) 0.6% ( -4% - 5%) OrNotHighHigh 52.47 (4.0%) 52.81 (3.7%) 0.6% ( -6% - 8%) AndHighLow 831.21 (5.2%) 836.72 (4.4%) 0.7% ( -8% - 10%) HighTerm 106.80 (4.9%) 107.52 (4.5%) 0.7% ( -8% - 10%) OrHighNotHigh 49.22 (4.2%) 49.58 (4.9%) 0.7% ( -7% - 10%) OrHighNotLow 72.74 (6.9%) 73.36 (7.2%) 0.9% ( -12% - 16%) LowSpanNear 40.43 (1.7%) 40.86 (2.0%) 1.1% ( -2% - 4%) Respell 85.26 (4.0%) 86.34 (2.6%) 1.3% ( -5% - 8%) Fuzzy1 76.05 (12.4%) 77.99 (14.1%) 2.6% ( -21% - 33%) OrHighHigh 21.18 (5.5%) 22.00 (5.6%) 3.9% ( -6% - 15%) OrHighMed 44.83 (5.4%) 46.73 (6.2%) 4.2% ( -6% - 16%) OrHighLow 53.74 (5.3%) 56.55 (5.7%) 5.2% ( -5% - 17%) Fuzzy2 49.00 (14.4%) 51.79 (9.5%) 5.7% ( -15% - 34%)
        Hide
        Paul Elschot added a comment -

        LGTM, core tests pass here.

        Show
        Paul Elschot added a comment - LGTM, core tests pass here.
        Hide
        ASF subversion and git services added a comment -

        Commit 1719470 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1719470 ]

        LUCENE-6815: Make DisjunctionScorer advance lazily.

        Show
        ASF subversion and git services added a comment - Commit 1719470 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1719470 ] LUCENE-6815 : Make DisjunctionScorer advance lazily.
        Hide
        ASF subversion and git services added a comment -

        Commit 1719490 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1719490 ]

        LUCENE-6815: Make DisjunctionScorer advance lazily.

        Show
        ASF subversion and git services added a comment - Commit 1719490 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1719490 ] LUCENE-6815 : Make DisjunctionScorer advance lazily.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development