Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7262

Add back the "estimate match count" optimization

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1, master (7.0)
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Follow-up to my last message on LUCENE-7051: I removed this optimization a while ago because it made things a bit more complicated but did not seem to help with point queries. However the reason why it did not seem to help was that the benchmark only runs queries that match 25% of the dataset. This makes the run time completely dominated by calls to FixedBitSet.set so the call to FixedBitSet.cardinality() looks free. However with slightly sparser queries like the geo benchmark generates (dense enough to trigger the creation of a FixedBitSet but sparse enough so that FixedBitSet.set does not dominate the run time), one can notice speed-ups when this call is skipped.

      1. LUCENE-7262.patch
        23 kB
        Adrien Grand
      2. LUCENE-7262.patch
        19 kB
        Adrien Grand
      3. LUCENE-7262.patch
        11 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          jpountz Adrien Grand added a comment -

          Here is a patch. It adds back the optimization in a similar way ta what MatchingPoints does, and adds som assertions to AssertingPointsFormat to make sure that there are no more calls to add() than what grow() reserved.

          Show
          jpountz Adrien Grand added a comment - Here is a patch. It adds back the optimization in a similar way ta what MatchingPoints does, and adds som assertions to AssertingPointsFormat to make sure that there are no more calls to add() than what grow() reserved.
          Hide
          rcmuir Robert Muir added a comment -

          So this means postings still calls cardinality()? Why wouldn't it do the same? I'm a bit concerned with each query tracking its own estimate (and having the formula/stats pulling etc duplicated everywhere).

          This is why when looking at MatchingPoints, it pulls the stats it needs. but alternatively DocIDSetBuilder could take parameters of sumDocFreq, maxDoc, docCount and do this itself. Points would pass size() for sumDocFreq, its the equivalent there.

          In other words, i see providing a good cost() as the responsibility of DocIDSetBuilder. The only thing impl-specific is how to get sumDocFreq and docCount (e.g. Terms.sumDocFreq/docCount vs PointValues.size/docCount).

          Show
          rcmuir Robert Muir added a comment - So this means postings still calls cardinality()? Why wouldn't it do the same? I'm a bit concerned with each query tracking its own estimate (and having the formula/stats pulling etc duplicated everywhere). This is why when looking at MatchingPoints, it pulls the stats it needs. but alternatively DocIDSetBuilder could take parameters of sumDocFreq, maxDoc, docCount and do this itself. Points would pass size() for sumDocFreq, its the equivalent there. In other words, i see providing a good cost() as the responsibility of DocIDSetBuilder. The only thing impl-specific is how to get sumDocFreq and docCount (e.g. Terms.sumDocFreq/docCount vs PointValues.size/docCount).
          Hide
          jpountz Adrien Grand added a comment -

          Good ideas. I added it as close as it was before LUCENE-7051 but I will give these ideas a try.

          Show
          jpountz Adrien Grand added a comment - Good ideas. I added it as close as it was before LUCENE-7051 but I will give these ideas a try.
          Hide
          rcmuir Robert Muir added a comment -

          I think the problem at LUCENE-7051 time was, that points didnt have any statistics. Now they do, so i think our job is easier.

          Show
          rcmuir Robert Muir added a comment - I think the problem at LUCENE-7051 time was, that points didnt have any statistics. Now they do, so i think our job is easier.
          Hide
          jpountz Adrien Grand added a comment -

          Here is a new patch. It also uses statistics to figure out whether the int[] needs to remove duplicates in the sparse case. When I patch the LatLon queries to use DocIdSetBuilder instead of MatchingPoints and run the geo benchmark, I get the following QPS:

          benchmark DocIdSetBuilder DocIdSetBuilder + patch
          distance 47.3 48.7
          poly 5 35.2 35.6

          So there is a noticeable speedup.

          Show
          jpountz Adrien Grand added a comment - Here is a new patch. It also uses statistics to figure out whether the int[] needs to remove duplicates in the sparse case. When I patch the LatLon queries to use DocIdSetBuilder instead of MatchingPoints and run the geo benchmark, I get the following QPS: benchmark DocIdSetBuilder DocIdSetBuilder + patch distance 47.3 48.7 poly 5 35.2 35.6 So there is a noticeable speedup.
          Hide
          rcmuir Robert Muir added a comment -

          nice to remove that dedup a lot of the time too.

          Show
          rcmuir Robert Muir added a comment - nice to remove that dedup a lot of the time too.
          Hide
          dsmiley David Smiley added a comment -

          +1 and nice testing. I think you can use the new constructor accepting Terms for stats in more places (judging from a find-usages on DocIdSetBuilder).

          Show
          dsmiley David Smiley added a comment - +1 and nice testing. I think you can use the new constructor accepting Terms for stats in more places (judging from a find-usages on DocIdSetBuilder).
          Hide
          jpountz Adrien Grand added a comment -

          I updated the patch to cutover a bit more classes to the constructor that takes a Terms instance. Now all users of DocIdSetBuilder use the constructor that takes a Terms/PointValues instance except one case: TermsQuery in the case of multiple fields. I'll commit soon.

          Show
          jpountz Adrien Grand added a comment - I updated the patch to cutover a bit more classes to the constructor that takes a Terms instance. Now all users of DocIdSetBuilder use the constructor that takes a Terms/PointValues instance except one case: TermsQuery in the case of multiple fields. I'll commit soon.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4fa2b29b200b2a92157396af3f485d38a4954e7a in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4fa2b29 ]

          LUCENE-7262: Leverage index statistics to make DocIdSetBuilder more efficient.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4fa2b29b200b2a92157396af3f485d38a4954e7a in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4fa2b29 ] LUCENE-7262 : Leverage index statistics to make DocIdSetBuilder more efficient.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit e9f2ac0021e004593599706f4e2db1bd1f724248 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e9f2ac0 ]

          LUCENE-7262: Leverage index statistics to make DocIdSetBuilder more efficient.

          Show
          jira-bot ASF subversion and git services added a comment - Commit e9f2ac0021e004593599706f4e2db1bd1f724248 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e9f2ac0 ] LUCENE-7262 : Leverage index statistics to make DocIdSetBuilder more efficient.
          Hide
          steve_rowe Steve Rowe added a comment -

          TestPointQueries failures reported on LUCENE-7269 appear to be related to this issue.

          Show
          steve_rowe Steve Rowe added a comment - TestPointQueries failures reported on LUCENE-7269 appear to be related to this issue.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 91153b9627d7bd9e17dcb4762ebbaf26bc3410f4 in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=91153b9 ]

          LUCENE-7262: Fix NPE, this should lazy-init in start()

          Show
          jira-bot ASF subversion and git services added a comment - Commit 91153b9627d7bd9e17dcb4762ebbaf26bc3410f4 in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=91153b9 ] LUCENE-7262 : Fix NPE, this should lazy-init in start()
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5b51479b69ec3c52e42c9b95418ee285080311f7 in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5b51479 ]

          LUCENE-7262: Fix NPE, this should lazy-init in start()
          (cherry picked from commit 91153b9)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5b51479b69ec3c52e42c9b95418ee285080311f7 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5b51479 ] LUCENE-7262 : Fix NPE, this should lazy-init in start() (cherry picked from commit 91153b9)
          Hide
          hossman Hoss Man added a comment -

          Manually correcting fixVersion per Step #S5 of LUCENE-7271

          Show
          hossman Hoss Man added a comment - Manually correcting fixVersion per Step #S5 of LUCENE-7271

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development