Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7641

Speed up point ranges that match most documents

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If a point range matches most documents and every document has exactly one value, then we could make things faster by computing the set of documents that do NOT match the range instead.

      It was not possible until recently since figuring out whether a range query matches most documents was not possible, but we can now use the new PointValues.estimatePointcount API to do that: we could just check whether the cost of the inverse visitor is lower than the cost of the regular range visitor.

      1. LUCENE-7461.patch
        7 kB
        Adrien Grand
      2. LUCENE-7641.patch
        11 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch that implements this idea.

        Show
        jpountz Adrien Grand added a comment - Here is a patch that implements this idea.
        Hide
        mikemccand Michael McCandless added a comment -

        This is a great idea!

        Why do you have to compute and check the inverseCost if you already know the cost() is > maxDoc()/2? Shouldn't the inverseCost always be around maxDoc()-cost()?

        Show
        mikemccand Michael McCandless added a comment - This is a great idea! Why do you have to compute and check the inverseCost if you already know the cost() is > maxDoc()/2 ? Shouldn't the inverseCost always be around maxDoc()-cost() ?
        Hide
        jpountz Adrien Grand added a comment -

        I guess I wanted to stay on the safe side since point count estimation tends to be overestimated. Maybe we should improve the formula to be more accurate instead of checking the inverse cost. For instance, maybe we should count maxPointsInLeafNode/2 when the relation is CELL_CROSSES_QUERY on a leaf cell as well as make BKDReader record the maximum number of points that have been put in a leaf in practice rather than the configuration parameter that was passed to BKDWriter, since the latter can be up to 2x the actual number of points in leaf nodes in the N-dims case?

        Show
        jpountz Adrien Grand added a comment - I guess I wanted to stay on the safe side since point count estimation tends to be overestimated. Maybe we should improve the formula to be more accurate instead of checking the inverse cost. For instance, maybe we should count maxPointsInLeafNode/2 when the relation is CELL_CROSSES_QUERY on a leaf cell as well as make BKDReader record the maximum number of points that have been put in a leaf in practice rather than the configuration parameter that was passed to BKDWriter , since the latter can be up to 2x the actual number of points in leaf nodes in the N-dims case?
        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch that does not compute the inverse cost and folds in the two ideas that I mentioned above to make point count estimation more accurate.

        I also ran a benchmark with luceneutil10m:

                            TaskQPS baseline      StdDev   QPS patch      StdDev                Pct diff
                          Fuzzy1      190.46      (7.7%)      176.23     (12.6%)   -7.5% ( -25% -   13%)
                         Respell      277.90      (3.6%)      275.11      (3.0%)   -1.0% (  -7% -    5%)
                     MedSpanNear      150.45      (2.5%)      149.27      (3.1%)   -0.8% (  -6% -    4%)
                        Wildcard      112.61      (6.4%)      111.76      (5.9%)   -0.8% ( -12% -   12%)
                   OrNotHighHigh       60.90      (3.6%)       60.51      (4.3%)   -0.6% (  -8% -    7%)
                      AndHighLow     1168.30      (3.9%)     1160.93      (3.8%)   -0.6% (  -8% -    7%)
                       OrHighMed       41.35      (5.1%)       41.16      (5.2%)   -0.4% ( -10% -   10%)
                    OrHighNotMed       69.18      (3.9%)       68.94      (4.1%)   -0.3% (  -8% -    8%)
                   OrHighNotHigh       47.52      (3.8%)       47.39      (4.1%)   -0.3% (  -7% -    7%)
                         LowTerm      691.05      (4.1%)      689.34      (4.5%)   -0.2% (  -8% -    8%)
                HighSloppyPhrase       20.35      (3.1%)       20.31      (3.1%)   -0.2% (  -6% -    6%)
                    HighSpanNear       27.30      (2.0%)       27.25      (1.8%)   -0.2% (  -3% -    3%)
                       LowPhrase      101.02      (2.2%)      100.92      (2.8%)   -0.1% (  -4% -    4%)
                          Fuzzy2       68.31      (5.8%)       68.25      (7.1%)   -0.1% ( -12% -   13%)
                     LowSpanNear      170.99      (2.7%)      170.87      (3.4%)   -0.1% (  -6% -    6%)
                 LowSloppyPhrase       55.34      (1.9%)       55.30      (2.1%)   -0.1% (  -4% -    4%)
                       OrHighLow       84.81      (3.8%)       84.81      (4.1%)    0.0% (  -7% -    8%)
                      AndHighMed      294.21      (2.3%)      294.28      (2.5%)    0.0% (  -4% -    4%)
                         Prefix3       32.27      (8.8%)       32.28      (8.0%)    0.0% ( -15% -   18%)
           HighTermDayOfYearSort       65.88      (2.4%)       65.92      (2.8%)    0.1% (  -5% -    5%)
                      HighPhrase       16.67      (2.1%)       16.70      (1.8%)    0.1% (  -3% -    4%)
                       MedPhrase       82.73      (3.2%)       82.86      (3.0%)    0.2% (  -5% -    6%)
                 MedSloppyPhrase       42.22      (2.1%)       42.29      (2.3%)    0.2% (  -4% -    4%)
                    OrNotHighLow      889.62      (3.3%)      891.74      (4.9%)    0.2% (  -7% -    8%)
                      OrHighHigh       32.35      (5.2%)       32.43      (5.2%)    0.3% (  -9% -   11%)
                    OrNotHighMed      158.41      (3.3%)      158.83      (3.6%)    0.3% (  -6% -    7%)
                     AndHighHigh       47.84      (1.4%)       47.98      (1.8%)    0.3% (  -2% -    3%)
                         MedTerm      240.59      (4.1%)      241.51      (3.6%)    0.4% (  -7% -    8%)
                    OrHighNotLow      111.19      (4.7%)      111.65      (5.3%)    0.4% (  -9% -   10%)
                        HighTerm      100.65      (4.3%)      101.63      (4.6%)    1.0% (  -7% -   10%)
               HighTermMonthSort      140.07     (11.1%)      141.91     (10.5%)    1.3% ( -18% -   25%)
                        IntNRQ10       76.64      (9.9%)       80.71      (2.8%)    5.3% (  -6% -   19%)
                        IntNRQ50       17.16     (11.4%)       18.11      (2.7%)    5.6% (  -7% -   22%)
                        IntNRQ25       33.52     (11.0%)       35.61      (3.0%)    6.2% (  -6% -   22%)
                          IntNRQ       12.62     (11.8%)       15.84      (4.4%)   25.6% (   8% -   47%)
                        IntNRQ75       11.57     (11.7%)       15.18      (4.7%)   31.2% (  13% -   54%)
                        IntNRQ90        9.71     (11.6%)       13.84      (5.6%)   42.6% (  22% -   67%)
        

        This looks like a good speedup to me.

        The ranges that are followed by a number are some additional ranges that I added in order to see the impact based on the percentage of matched documents. For instance IntNRQ75 has ranges that match 75% of documents. Here are the tasks that I appended to wikimedium.10M.nostopwords.tasks if you would like to reproduce:

        IntNRQ10: nrq//timesecnum 6207 15563
        IntNRQ10: nrq//timesecnum 53 8742
        IntNRQ10: nrq//timesecnum 77160 84811
        IntNRQ25: nrq//timesecnum 8899 36335
        IntNRQ25: nrq//timesecnum 68109 86101
        IntNRQ25: nrq//timesecnum 44771 64123
        IntNRQ50: nrq//timesecnum 101 51223
        IntNRQ50: nrq//timesecnum 25811 69953
        IntNRQ50: nrq//timesecnum 46298 82710
        IntNRQ75: nrq//timesecnum 13867 79886
        IntNRQ75: nrq//timesecnum 8830 75922
        IntNRQ75: nrq//timesecnum 17003 82111
        IntNRQ90: nrq//timesecnum 1810 80159
        IntNRQ90: nrq//timesecnum 6160 84811
        IntNRQ90: nrq//timesecnum 8210 86330
        

        It is interesting that small ranges seem to also benefit from a small speedup (I reran the benchmark to confirm and it looks reproducible). I am unsure why, but suspect that the fact that we use a different code path for large ranges makes things easier for the JVM to optimize on small ranges .

        Show
        jpountz Adrien Grand added a comment - Here is a patch that does not compute the inverse cost and folds in the two ideas that I mentioned above to make point count estimation more accurate. I also ran a benchmark with luceneutil10m: TaskQPS baseline StdDev QPS patch StdDev Pct diff Fuzzy1 190.46 (7.7%) 176.23 (12.6%) -7.5% ( -25% - 13%) Respell 277.90 (3.6%) 275.11 (3.0%) -1.0% ( -7% - 5%) MedSpanNear 150.45 (2.5%) 149.27 (3.1%) -0.8% ( -6% - 4%) Wildcard 112.61 (6.4%) 111.76 (5.9%) -0.8% ( -12% - 12%) OrNotHighHigh 60.90 (3.6%) 60.51 (4.3%) -0.6% ( -8% - 7%) AndHighLow 1168.30 (3.9%) 1160.93 (3.8%) -0.6% ( -8% - 7%) OrHighMed 41.35 (5.1%) 41.16 (5.2%) -0.4% ( -10% - 10%) OrHighNotMed 69.18 (3.9%) 68.94 (4.1%) -0.3% ( -8% - 8%) OrHighNotHigh 47.52 (3.8%) 47.39 (4.1%) -0.3% ( -7% - 7%) LowTerm 691.05 (4.1%) 689.34 (4.5%) -0.2% ( -8% - 8%) HighSloppyPhrase 20.35 (3.1%) 20.31 (3.1%) -0.2% ( -6% - 6%) HighSpanNear 27.30 (2.0%) 27.25 (1.8%) -0.2% ( -3% - 3%) LowPhrase 101.02 (2.2%) 100.92 (2.8%) -0.1% ( -4% - 4%) Fuzzy2 68.31 (5.8%) 68.25 (7.1%) -0.1% ( -12% - 13%) LowSpanNear 170.99 (2.7%) 170.87 (3.4%) -0.1% ( -6% - 6%) LowSloppyPhrase 55.34 (1.9%) 55.30 (2.1%) -0.1% ( -4% - 4%) OrHighLow 84.81 (3.8%) 84.81 (4.1%) 0.0% ( -7% - 8%) AndHighMed 294.21 (2.3%) 294.28 (2.5%) 0.0% ( -4% - 4%) Prefix3 32.27 (8.8%) 32.28 (8.0%) 0.0% ( -15% - 18%) HighTermDayOfYearSort 65.88 (2.4%) 65.92 (2.8%) 0.1% ( -5% - 5%) HighPhrase 16.67 (2.1%) 16.70 (1.8%) 0.1% ( -3% - 4%) MedPhrase 82.73 (3.2%) 82.86 (3.0%) 0.2% ( -5% - 6%) MedSloppyPhrase 42.22 (2.1%) 42.29 (2.3%) 0.2% ( -4% - 4%) OrNotHighLow 889.62 (3.3%) 891.74 (4.9%) 0.2% ( -7% - 8%) OrHighHigh 32.35 (5.2%) 32.43 (5.2%) 0.3% ( -9% - 11%) OrNotHighMed 158.41 (3.3%) 158.83 (3.6%) 0.3% ( -6% - 7%) AndHighHigh 47.84 (1.4%) 47.98 (1.8%) 0.3% ( -2% - 3%) MedTerm 240.59 (4.1%) 241.51 (3.6%) 0.4% ( -7% - 8%) OrHighNotLow 111.19 (4.7%) 111.65 (5.3%) 0.4% ( -9% - 10%) HighTerm 100.65 (4.3%) 101.63 (4.6%) 1.0% ( -7% - 10%) HighTermMonthSort 140.07 (11.1%) 141.91 (10.5%) 1.3% ( -18% - 25%) IntNRQ10 76.64 (9.9%) 80.71 (2.8%) 5.3% ( -6% - 19%) IntNRQ50 17.16 (11.4%) 18.11 (2.7%) 5.6% ( -7% - 22%) IntNRQ25 33.52 (11.0%) 35.61 (3.0%) 6.2% ( -6% - 22%) IntNRQ 12.62 (11.8%) 15.84 (4.4%) 25.6% ( 8% - 47%) IntNRQ75 11.57 (11.7%) 15.18 (4.7%) 31.2% ( 13% - 54%) IntNRQ90 9.71 (11.6%) 13.84 (5.6%) 42.6% ( 22% - 67%) This looks like a good speedup to me. The ranges that are followed by a number are some additional ranges that I added in order to see the impact based on the percentage of matched documents. For instance IntNRQ75 has ranges that match 75% of documents. Here are the tasks that I appended to wikimedium.10M.nostopwords.tasks if you would like to reproduce: IntNRQ10: nrq//timesecnum 6207 15563 IntNRQ10: nrq//timesecnum 53 8742 IntNRQ10: nrq//timesecnum 77160 84811 IntNRQ25: nrq//timesecnum 8899 36335 IntNRQ25: nrq//timesecnum 68109 86101 IntNRQ25: nrq//timesecnum 44771 64123 IntNRQ50: nrq//timesecnum 101 51223 IntNRQ50: nrq//timesecnum 25811 69953 IntNRQ50: nrq//timesecnum 46298 82710 IntNRQ75: nrq//timesecnum 13867 79886 IntNRQ75: nrq//timesecnum 8830 75922 IntNRQ75: nrq//timesecnum 17003 82111 IntNRQ90: nrq//timesecnum 1810 80159 IntNRQ90: nrq//timesecnum 6160 84811 IntNRQ90: nrq//timesecnum 8210 86330 It is interesting that small ranges seem to also benefit from a small speedup (I reran the benchmark to confirm and it looks reproducible). I am unsure why, but suspect that the fact that we use a different code path for large ranges makes things easier for the JVM to optimize on small ranges .
        Hide
        mikemccand Michael McCandless added a comment -

        +1, looks great!

        Show
        mikemccand Michael McCandless added a comment - +1, looks great!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 3404677e57fcf7901813f7d7ccfc3e57db011993 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3404677 ]

        LUCENE-7641: Speed up range queries that match most documents.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 3404677e57fcf7901813f7d7ccfc3e57db011993 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3404677 ] LUCENE-7641 : Speed up range queries that match most documents.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit be39d1fd0971c211a05ee8ba6e7215e0c8cf5190 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=be39d1f ]

        LUCENE-7641: Speed up range queries that match most documents.

        Show
        jira-bot ASF subversion and git services added a comment - Commit be39d1fd0971c211a05ee8ba6e7215e0c8cf5190 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=be39d1f ] LUCENE-7641 : Speed up range queries that match most documents.
        Hide
        jpountz Adrien Grand added a comment -

        The nightly benchmarks confirm the good speedup. http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html

        Show
        jpountz Adrien Grand added a comment - The nightly benchmarks confirm the good speedup. http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html
        Hide
        mikemccand Michael McCandless added a comment -

        Nice!

        Show
        mikemccand Michael McCandless added a comment - Nice!

          People

          • Assignee:
            jpountz Adrien Grand
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development