Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5979

Use the cost API instead of a heuristic on the first document in FilteredQuery to decide on whether to use random access

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Now that some major filters such as TermsFilter and MultiTermQueryWrapperFilter return DocIdSets that have a better cost, we should switch to the cost API.

        Issue Links

          Activity

          Hide
          jpountz Adrien Grand added a comment -

          Here is a simple patch. I imagine the idea of firstFilterDoc < 100 was to use random-access when the filter matches more than 1% of documents, so I translated it to filterCost * 100 > maxDoc.

          Show
          jpountz Adrien Grand added a comment - Here is a simple patch. I imagine the idea of firstFilterDoc < 100 was to use random-access when the filter matches more than 1% of documents, so I translated it to filterCost * 100 > maxDoc .
          Hide
          rcmuir Robert Muir added a comment -

          I like the idea of using a better heuristic here, I guess my question is if interpreting the cost method as an absolute value (versus just using it as a relative one) is safe?

          Show
          rcmuir Robert Muir added a comment - I like the idea of using a better heuristic here, I guess my question is if interpreting the cost method as an absolute value (versus just using it as a relative one) is safe?
          Hide
          jpountz Adrien Grand added a comment -

          I don't think the documentation of the cost API prevents from interpreting the cost as an absolute value? Even if it is not something that we do today, I think all our current DocIdSetIterator (but FixedBitSet) return cost values that are ok to intepret as an absolute value? Even if this estimation is very vague it would be much better than what we have today to decide on the filter strategy to use. Moreover I think this could also prove valuable in CachingWrapperFilter to decide on the cacheable impl to use: some impls would likely better at high-cardinality and vice-versa?

          Show
          jpountz Adrien Grand added a comment - I don't think the documentation of the cost API prevents from interpreting the cost as an absolute value? Even if it is not something that we do today, I think all our current DocIdSetIterator (but FixedBitSet) return cost values that are ok to intepret as an absolute value? Even if this estimation is very vague it would be much better than what we have today to decide on the filter strategy to use. Moreover I think this could also prove valuable in CachingWrapperFilter to decide on the cacheable impl to use: some impls would likely better at high-cardinality and vice-versa?
          Hide
          rcmuir Robert Muir added a comment -

          I don't think the documentation prevents it, its just this would be the first use of it.

          so I am +1, I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet?

          Show
          rcmuir Robert Muir added a comment - I don't think the documentation prevents it, its just this would be the first use of it. so I am +1, I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet?
          Hide
          thetaphi Uwe Schindler added a comment -

          Thanks for removing the crappy extra leapfrog iterator that was just there to deliver the firstFilterDoc (because it was consumed before). I am unfortunately not completely familar with the absolute or non-absolute cost values to make a good guess if cost * 100 > maxDoc is fine...

          Show
          thetaphi Uwe Schindler added a comment - Thanks for removing the crappy extra leapfrog iterator that was just there to deliver the firstFilterDoc (because it was consumed before). I am unfortunately not completely familar with the absolute or non-absolute cost values to make a good guess if cost * 100 > maxDoc is fine...
          Hide
          thetaphi Uwe Schindler added a comment -

          I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet?

          LUCENE-5441 was opened to solve this. I just did not finish it, but the patch should still work as base for extending.

          Show
          thetaphi Uwe Schindler added a comment - I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet? LUCENE-5441 was opened to solve this. I just did not finish it, but the patch should still work as base for extending.
          Hide
          jpountz Adrien Grand added a comment -

          I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet?

          LUCENE-5938 makes sure to not use FixedBitSet when the doc ids to store are sparse. I think the only last bad guys are ChainedFilter (which I proposed to remove) and BooleanFilter (that could be fixed). So I don't think this should be a blocker for this issue?

          Show
          jpountz Adrien Grand added a comment - I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet? LUCENE-5938 makes sure to not use FixedBitSet when the doc ids to store are sparse. I think the only last bad guys are ChainedFilter (which I proposed to remove) and BooleanFilter (that could be fixed). So I don't think this should be a blocker for this issue?
          Hide
          rcmuir Robert Muir added a comment -

          ok.

          Show
          rcmuir Robert Muir added a comment - ok.
          Hide
          jpountz Adrien Grand added a comment -

          If there are no concerns anymore, I will commit this patch soon.

          Show
          jpountz Adrien Grand added a comment - If there are no concerns anymore, I will commit this patch soon.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1629598 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1629598 ]

          LUCENE-5979: Use the cost API to decide on whether to use random-access to intersect queries and filters.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1629598 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1629598 ] LUCENE-5979 : Use the cost API to decide on whether to use random-access to intersect queries and filters.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1629599 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1629599 ]

          LUCENE-5979: Use the cost API to decide on whether to use random-access to intersect queries and filters.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1629599 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1629599 ] LUCENE-5979 : Use the cost API to decide on whether to use random-access to intersect queries and filters.
          Hide
          anshumg Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          anshumg Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              jpountz Adrien Grand
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development