Lucene - Core
  1. Lucene - Core
  2. LUCENE-5101

make it easier to plugin different bitset implementations to CachingWrapperFilter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently this is possible, but its not so friendly:

        protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) throws IOException {
          if (docIdSet == null) {
            // this is better than returning null, as the nonnull result can be cached
            return EMPTY_DOCIDSET;
          } else if (docIdSet.isCacheable()) {
            return docIdSet;
          } else {
            final DocIdSetIterator it = docIdSet.iterator();
            // null is allowed to be returned by iterator(),
            // in this case we wrap with the sentinel set,
            // which is cacheable.
            if (it == null) {
              return EMPTY_DOCIDSET;
            } else {
      /* INTERESTING PART */
              final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
              bits.or(it);
              return bits;
      /* END INTERESTING PART */
            }
          }
        }
      

      Is there any value to having all this other logic in the protected API? It seems like something thats not useful for a subclass... Maybe this stuff can become final, and "INTERESTING PART" calls a simpler method, something like:

      protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) {
        final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
        bits.or(iterator);
        return bits;
      }
      
      1. LUCENE-5101.patch
        3 kB
        Robert Muir
      2. DocIdSetBenchmark.java
        8 kB
        Adrien Grand
      3. LUCENE-5101.patch
        7 kB
        Robert Muir
      4. LUCENE-5101.patch
        24 kB
        Adrien Grand
      5. LUCENE-5101.patch
        24 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close
        Hide
        Adrien Grand added a comment -

        Committed, thanks Robert!

        Show
        Adrien Grand added a comment - Committed, thanks Robert!
        Hide
        ASF subversion and git services added a comment -

        Commit 1520527 from Adrien Grand in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1520527 ]

        LUCENE-5101: Make it easier to plugin different bitset implementations to CachingWrapperFilter.

        Show
        ASF subversion and git services added a comment - Commit 1520527 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1520527 ] LUCENE-5101 : Make it easier to plugin different bitset implementations to CachingWrapperFilter.
        Hide
        ASF subversion and git services added a comment -

        Commit 1520525 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1520525 ]

        LUCENE-5101: Make it easier to plugin different bitset implementations to CachingWrapperFilter.

        Show
        ASF subversion and git services added a comment - Commit 1520525 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1520525 ] LUCENE-5101 : Make it easier to plugin different bitset implementations to CachingWrapperFilter.
        Hide
        Adrien Grand added a comment -

        Good point, here is an updated patch with an additional comment explaining the difference with CWF.

        Show
        Adrien Grand added a comment - Good point, here is an updated patch with an additional comment explaining the difference with CWF.
        Hide
        Robert Muir added a comment -

        ok, makes sense: thanks. maybe we could add this to the docs or in a code comment?

        patch looks good to me.

        Show
        Robert Muir added a comment - ok, makes sense: thanks. maybe we could add this to the docs or in a code comment? patch looks good to me.
        Hide
        Adrien Grand added a comment -

        This CWF is a bit different in that if it gets a cacheable filter which is not a FixedBitSet, it will convert it to a FixedBitSet instead of keeping it as-is. I think this is required for instance if users have custom filter implementations that return cacheable filters which are not a FixedBitSet (some filters return cacheable doc id sets, for instance TermsFilter could be backed by something else than a FixedBitSet in the future).

        Show
        Adrien Grand added a comment - This CWF is a bit different in that if it gets a cacheable filter which is not a FixedBitSet, it will convert it to a FixedBitSet instead of keeping it as-is. I think this is required for instance if users have custom filter implementations that return cacheable filters which are not a FixedBitSet (some filters return cacheable doc id sets, for instance TermsFilter could be backed by something else than a FixedBitSet in the future).
        Hide
        Robert Muir added a comment -

        Can the FixedBitSetCachingWrapperFilter.java just override cacheImpl instead of all the boilerplate?

        Show
        Robert Muir added a comment - Can the FixedBitSetCachingWrapperFilter.java just override cacheImpl instead of all the boilerplate?
        Hide
        Adrien Grand added a comment -

        Indeed! Here is a complete patch.

        Show
        Adrien Grand added a comment - Indeed! Here is a complete patch.
        Hide
        Robert Muir added a comment -

        did you forget to 'svn add'?

        Show
        Robert Muir added a comment - did you forget to 'svn add'?
        Hide
        Adrien Grand added a comment -

        I iterated on Robert's patch to fix failures in the join module which were due to the fact that CWF doesn't always produce FixedBitSets anymore.

        I added a new FixedBitSetCachingWrapperFilter in oal.search.join which extends CWF to always produce FixedBitSets, as required by To(Parent|Child)BlockJoinQuery.

        All tests passed.

        Show
        Adrien Grand added a comment - I iterated on Robert's patch to fix failures in the join module which were due to the fact that CWF doesn't always produce FixedBitSets anymore. I added a new FixedBitSetCachingWrapperFilter in oal.search.join which extends CWF to always produce FixedBitSets, as required by To(Parent|Child)BlockJoinQuery. All tests passed.
        Hide
        Paul Elschot added a comment -

        Patch looks good to me, too.

        Hopefully we'll get some early feedback about performance.

        Show
        Paul Elschot added a comment - Patch looks good to me, too. Hopefully we'll get some early feedback about performance.
        Hide
        Adrien Grand added a comment -

        Patch looks good to me. Since join users are going to need fixed bit sets, maybe we should have a FixedBitSetCachingWrapperFilter so that they don't need to instantiate an anynous class to override cacheImpl every time?

        Show
        Adrien Grand added a comment - Patch looks good to me. Since join users are going to need fixed bit sets, maybe we should have a FixedBitSetCachingWrapperFilter so that they don't need to instantiate an anynous class to override cacheImpl every time?
        Hide
        Robert Muir added a comment -

        Updated patch that just defaults to compressing but with the same protected method. I also added some tests to TestCWF (for the corner cases of the boilerplate stuff).

        Show
        Robert Muir added a comment - Updated patch that just defaults to compressing but with the same protected method. I also added some tests to TestCWF (for the corner cases of the boilerplate stuff).
        Hide
        Robert Muir added a comment -

        I think that would be a reasonable default.

        I still think it would be good to improve the API in a way that makes it easier for someone to add their own heuristics/logic/subclasses and use different implementations. Its just unfortunate there is some boilerplate and cornercases they have to worry about (null DocIDSet, null Iterator, etc etc).

        The way the patch does it is just one possibility (we could also just move forward with a better default and not change the API, too).

        Show
        Robert Muir added a comment - I think that would be a reasonable default. I still think it would be good to improve the API in a way that makes it easier for someone to add their own heuristics/logic/subclasses and use different implementations. Its just unfortunate there is some boilerplate and cornercases they have to worry about (null DocIDSet, null Iterator, etc etc). The way the patch does it is just one possibility (we could also just move forward with a better default and not change the API, too).
        Hide
        Adrien Grand added a comment -

        I just updated http://people.apache.org/~jpountz/doc_id_sets.html based on the latest updated on WAH8DocIdSet and PFORDeltaDocIdSet and added the building time to the charts. PFORDeltaDocIdSet now better compresses doc id sets of medium load factors (~ 0.7) by switching to unary coding when it saves space compared to PFor. This way, it never grows much larger than a FixedBitSet. Similarly, WAH8DocIdSet now better compresses dense sets (LUCENE-5150).

        I wish I had more time to work on the EliasFanoDocIdSet to add an index to it and see how it behaves. It has interesting characteristics!

        Regarding this issue, I think we could update the patch to always use WAH8DocIdSet? I think the most interesting benchmark is advance(31) since it maps to a Scorer that performs leap frog between a query and a filter that both contain lots of documents (hence the query execution is slow) and WAH8DocIdSet is the one which has the best worst case here?

        Show
        Adrien Grand added a comment - I just updated http://people.apache.org/~jpountz/doc_id_sets.html based on the latest updated on WAH8DocIdSet and PFORDeltaDocIdSet and added the building time to the charts. PFORDeltaDocIdSet now better compresses doc id sets of medium load factors (~ 0.7) by switching to unary coding when it saves space compared to PFor. This way, it never grows much larger than a FixedBitSet. Similarly, WAH8DocIdSet now better compresses dense sets ( LUCENE-5150 ). I wish I had more time to work on the EliasFanoDocIdSet to add an index to it and see how it behaves. It has interesting characteristics! Regarding this issue, I think we could update the patch to always use WAH8DocIdSet? I think the most interesting benchmark is advance(31) since it maps to a Scorer that performs leap frog between a query and a filter that both contain lots of documents (hence the query execution is slow) and WAH8DocIdSet is the one which has the best worst case here?
        Hide
        Paul Elschot added a comment -

        I simplified the benchmark somewhat and ran it on my 32 bit home machine.

        The relative results for EliasFanoDocIdSet are very similar to the ones above, so this benchmark is a good measuring tool for me. I'll use it to verify performance of an index once it is working, LUCENE-5109.

        Show
        Paul Elschot added a comment - I simplified the benchmark somewhat and ran it on my 32 bit home machine. The relative results for EliasFanoDocIdSet are very similar to the ones above, so this benchmark is a good measuring tool for me. I'll use it to verify performance of an index once it is working, LUCENE-5109 .
        Hide
        Adrien Grand added a comment -

        Well spotted. Maybe I did a mistake when moving the data from the benchmark output to the charts. I modified the program so that it outputs directly the input of the charts. See the updated charts at http://people.apache.org/~jpountz/doc_id_sets.html. I also modified it so that memory uses a log scale too.

        Show
        Adrien Grand added a comment - Well spotted. Maybe I did a mistake when moving the data from the benchmark output to the charts. I modified the program so that it outputs directly the input of the charts. See the updated charts at http://people.apache.org/~jpountz/doc_id_sets.html . I also modified it so that memory uses a log scale too.
        Hide
        Paul Elschot added a comment - - edited

        I had another look at the recent benchmark results and something does not seem in order there.

        At density -2 (1%), Elias-Fano is faster at advance(docID() +1) (2.45 times fixed) than at nextDoc() (1.81 times fixed), and I would expect that FixedBitSet would have an almost equal run times for advance(docId()+1) and nextDoc().

        The code for advance (advanceToValue in EliasFanoDecoder) is really more complex than the code for nextDoc (nextValue in EliasFanoDecoder) and the code at EliasFanoDocIdSet is so simple that it should not really influence things here.
        So for EliasFanoDocIdSet advance(docId() + 1) should normally be slower than nextDoc(), but the benchmark contradicts this.

        Could there be a mistake in the benchmark for these cases? Or is this within expected (JIT) tolerances?

        Show
        Paul Elschot added a comment - - edited I had another look at the recent benchmark results and something does not seem in order there. At density -2 (1%), Elias-Fano is faster at advance(docID() +1) (2.45 times fixed) than at nextDoc() (1.81 times fixed), and I would expect that FixedBitSet would have an almost equal run times for advance(docId()+1) and nextDoc(). The code for advance (advanceToValue in EliasFanoDecoder) is really more complex than the code for nextDoc (nextValue in EliasFanoDecoder) and the code at EliasFanoDocIdSet is so simple that it should not really influence things here. So for EliasFanoDocIdSet advance(docId() + 1) should normally be slower than nextDoc(), but the benchmark contradicts this. Could there be a mistake in the benchmark for these cases? Or is this within expected (JIT) tolerances?
        Hide
        Adrien Grand added a comment -

        Do WAH8 and PFOR already have an index?

        They do, but the index is naive: it is a plain binary search over a subset of the (docID,position) pairs contained in the set. With the first versions of these DocIdSets, I just wanted to guarantee O(log(cardinality)) advance performance.

        Block decoding might still be added to EliasFano, which should improve its nextDoc() performance

        The main use-case I see for these sets is to be used as filters. So I think advance() performance is more important?

        The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR.

        Well, the PFOR doc ID set is not tuned either. But I agree this is a good surprise for the Elias-Fano set. I mean even the WAH8 doc id set should be pretty fast and is still slower than the Elias-Fano set.

        Another surprise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then.

        I'm looking forward for the index.

        For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper?

        I guess so.

        Show
        Adrien Grand added a comment - Do WAH8 and PFOR already have an index? They do, but the index is naive: it is a plain binary search over a subset of the (docID,position) pairs contained in the set. With the first versions of these DocIdSets, I just wanted to guarantee O(log(cardinality)) advance performance. Block decoding might still be added to EliasFano, which should improve its nextDoc() performance The main use-case I see for these sets is to be used as filters. So I think advance() performance is more important? The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR. Well, the PFOR doc ID set is not tuned either. But I agree this is a good surprise for the Elias-Fano set. I mean even the WAH8 doc id set should be pretty fast and is still slower than the Elias-Fano set. Another surprise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then. I'm looking forward for the index. For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper? I guess so.
        Hide
        Paul Elschot added a comment - - edited

        I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/~jpountz/doc_id_sets.html

        Thank you very much, plenty of dilemma's ahead. Do WAH8 and PFOR already have an index?
        With an index, each of these, including Elias-Fano, should have about constant access time when advancing far enough. What that constant time will be is still open.

        Block decoding might still be added to EliasFano, which should improve its nextDoc() performance, but I have no idea by how much. See also at LUCENE-2750 for Kamikaze PFOR.
        The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR.

        Another surpise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then.

        For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper?

        Show
        Paul Elschot added a comment - - edited I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/~jpountz/doc_id_sets.html Thank you very much, plenty of dilemma's ahead. Do WAH8 and PFOR already have an index? With an index, each of these, including Elias-Fano, should have about constant access time when advancing far enough. What that constant time will be is still open. Block decoding might still be added to EliasFano, which should improve its nextDoc() performance, but I have no idea by how much. See also at LUCENE-2750 for Kamikaze PFOR. The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR. Another surpise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then. For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper?
        Hide
        Robert Muir added a comment -

        I was thinking about a heuristic like this...

        Show
        Robert Muir added a comment - I was thinking about a heuristic like this...
        Hide
        Robert Muir added a comment -

        Maybe we could use these numbers to have better defaults in CWF? (and only use FixedBitSet for dense sets for example)

        +1: we should have better defaults. Ideally we would use DISI.cost() to estimate the sparsity.

        One problem is a lot of the costly filters that people want to cache have a crap cost() implementation.
        e.g. MultiTermQueryWrapperFilter could instead getAndSet() and return a DISI with an actual accurate cost().

        Or instead for now, we could also check firstDocID too...

        Show
        Robert Muir added a comment - Maybe we could use these numbers to have better defaults in CWF? (and only use FixedBitSet for dense sets for example) +1: we should have better defaults. Ideally we would use DISI.cost() to estimate the sparsity. One problem is a lot of the costly filters that people want to cache have a crap cost() implementation. e.g. MultiTermQueryWrapperFilter could instead getAndSet() and return a DISI with an actual accurate cost(). Or instead for now, we could also check firstDocID too...
        Hide
        Adrien Grand added a comment -

        A quick note about alternative DocIdSet we now have. I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/~jpountz/doc_id_sets.html Please note that EliasFanoDocIdSet is disadvantaged for advance() since it doesn't have an index yet, it will be interesting to run this benchmark again when it gets one.

        Maybe we could use these numbers to have better defaults in CWF? (and only use FixedBitSet for dense sets for example)

        Show
        Adrien Grand added a comment - A quick note about alternative DocIdSet we now have. I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/~jpountz/doc_id_sets.html Please note that EliasFanoDocIdSet is disadvantaged for advance() since it doesn't have an index yet, it will be interesting to run this benchmark again when it gets one. Maybe we could use these numbers to have better defaults in CWF? (and only use FixedBitSet for dense sets for example)
        Hide
        Adrien Grand added a comment -

        +1 The two new impls at least can be constructed from a DocIdSetIterator.

        Maybe we should have another protected method for what to do in case the doc ID set returned by the filter is already cacheable (by default it would return the set directly). Some filters return doc id sets which are already cacheable (TermsFilter for example) and it could be useful to make sure a CWF always returns the same impl (for instance block joins want a FixedBitSet).

        Show
        Adrien Grand added a comment - +1 The two new impls at least can be constructed from a DocIdSetIterator. Maybe we should have another protected method for what to do in case the doc ID set returned by the filter is already cacheable (by default it would return the set directly). Some filters return doc id sets which are already cacheable (TermsFilter for example) and it could be useful to make sure a CWF always returns the same impl (for instance block joins want a FixedBitSet).
        Hide
        Robert Muir added a comment -

        I suppose taking DocIDSet is more flexible, in case you can build it in a faster way.
        But it would be good to make this simpler (less boilerplate) without losing the flexibility.

        Maybe a bigger change is needed: we have quite a few impls now and CWF is messy:
        for example I think EMPTY_DOCIDSET should not exist here in CWF, but instead be DocIDSet.EMPTY...

        Anyway this is just to hopefully encourage discussion about how we can make it easier to use the new impls

        Show
        Robert Muir added a comment - I suppose taking DocIDSet is more flexible, in case you can build it in a faster way. But it would be good to make this simpler (less boilerplate) without losing the flexibility. Maybe a bigger change is needed: we have quite a few impls now and CWF is messy: for example I think EMPTY_DOCIDSET should not exist here in CWF, but instead be DocIDSet.EMPTY... Anyway this is just to hopefully encourage discussion about how we can make it easier to use the new impls

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development