Lucene - Core
  1. Lucene - Core
  2. LUCENE-5979

Use the cost API instead of a heuristic on the first document in FilteredQuery to decide on whether to use random access

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Now that some major filters such as TermsFilter and MultiTermQueryWrapperFilter return DocIdSets that have a better cost, we should switch to the cost API.

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          Here is a simple patch. I imagine the idea of firstFilterDoc < 100 was to use random-access when the filter matches more than 1% of documents, so I translated it to filterCost * 100 > maxDoc.

          Show
          Adrien Grand added a comment - Here is a simple patch. I imagine the idea of firstFilterDoc < 100 was to use random-access when the filter matches more than 1% of documents, so I translated it to filterCost * 100 > maxDoc .
          Hide
          Robert Muir added a comment -

          I like the idea of using a better heuristic here, I guess my question is if interpreting the cost method as an absolute value (versus just using it as a relative one) is safe?

          Show
          Robert Muir added a comment - I like the idea of using a better heuristic here, I guess my question is if interpreting the cost method as an absolute value (versus just using it as a relative one) is safe?
          Hide
          Adrien Grand added a comment -

          I don't think the documentation of the cost API prevents from interpreting the cost as an absolute value? Even if it is not something that we do today, I think all our current DocIdSetIterator (but FixedBitSet) return cost values that are ok to intepret as an absolute value? Even if this estimation is very vague it would be much better than what we have today to decide on the filter strategy to use. Moreover I think this could also prove valuable in CachingWrapperFilter to decide on the cacheable impl to use: some impls would likely better at high-cardinality and vice-versa?

          Show
          Adrien Grand added a comment - I don't think the documentation of the cost API prevents from interpreting the cost as an absolute value? Even if it is not something that we do today, I think all our current DocIdSetIterator (but FixedBitSet) return cost values that are ok to intepret as an absolute value? Even if this estimation is very vague it would be much better than what we have today to decide on the filter strategy to use. Moreover I think this could also prove valuable in CachingWrapperFilter to decide on the cacheable impl to use: some impls would likely better at high-cardinality and vice-versa?
          Hide
          Robert Muir added a comment -

          I don't think the documentation prevents it, its just this would be the first use of it.

          so I am +1, I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet?

          Show
          Robert Muir added a comment - I don't think the documentation prevents it, its just this would be the first use of it. so I am +1, I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet?
          Hide
          Uwe Schindler added a comment -

          Thanks for removing the crappy extra leapfrog iterator that was just there to deliver the firstFilterDoc (because it was consumed before). I am unfortunately not completely familar with the absolute or non-absolute cost values to make a good guess if cost * 100 > maxDoc is fine...

          Show
          Uwe Schindler added a comment - Thanks for removing the crappy extra leapfrog iterator that was just there to deliver the firstFilterDoc (because it was consumed before). I am unfortunately not completely familar with the absolute or non-absolute cost values to make a good guess if cost * 100 > maxDoc is fine...
          Hide
          Uwe Schindler added a comment -

          I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet?

          LUCENE-5441 was opened to solve this. I just did not finish it, but the patch should still work as base for extending.

          Show
          Uwe Schindler added a comment - I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet? LUCENE-5441 was opened to solve this. I just did not finish it, but the patch should still work as base for extending.
          Hide
          Adrien Grand added a comment -

          I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet?

          LUCENE-5938 makes sure to not use FixedBitSet when the doc ids to store are sparse. I think the only last bad guys are ChainedFilter (which I proposed to remove) and BooleanFilter (that could be fixed). So I don't think this should be a blocker for this issue?

          Show
          Adrien Grand added a comment - I just am worried about 'crappy' implementations though, like FixedBitSet. Must that one really implement DocIDSet? LUCENE-5938 makes sure to not use FixedBitSet when the doc ids to store are sparse. I think the only last bad guys are ChainedFilter (which I proposed to remove) and BooleanFilter (that could be fixed). So I don't think this should be a blocker for this issue?
          Hide
          Robert Muir added a comment -

          ok.

          Show
          Robert Muir added a comment - ok.
          Hide
          Adrien Grand added a comment -

          If there are no concerns anymore, I will commit this patch soon.

          Show
          Adrien Grand added a comment - If there are no concerns anymore, I will commit this patch soon.
          Hide
          ASF subversion and git services added a comment -

          Commit 1629598 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1629598 ]

          LUCENE-5979: Use the cost API to decide on whether to use random-access to intersect queries and filters.

          Show
          ASF subversion and git services added a comment - Commit 1629598 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1629598 ] LUCENE-5979 : Use the cost API to decide on whether to use random-access to intersect queries and filters.
          Hide
          ASF subversion and git services added a comment -

          Commit 1629599 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1629599 ]

          LUCENE-5979: Use the cost API to decide on whether to use random-access to intersect queries and filters.

          Show
          ASF subversion and git services added a comment - Commit 1629599 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1629599 ] LUCENE-5979 : Use the cost API to decide on whether to use random-access to intersect queries and filters.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development