Lucene - Core
  1. Lucene - Core
  2. LUCENE-1296

Allow use of compact DocIdSet in CachingWrapperFilter

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Extends CachingWrapperFilter with a protected method to determine the DocIdSet to be cached.

      1. cachedFilter20080529.patch
        7 kB
        Paul Elschot
      2. cachedFilter20080605.patch
        4 kB
        Paul Elschot
      3. LUCENE-1296.patch
        3 kB
        Michael McCandless
      4. LUCENE-1296.patch
        3 kB
        Paul Elschot
      5. LUCENE-1296b.patch
        4 kB
        Paul Elschot

        Activity

        Hide
        Paul Elschot added a comment - - edited

        The 20080529 patch patches CachingWrapperFilter and its test to add a choice of a compact filter to be cached, as well as some recently patched ( LUCENE-1187 ) contrib filter classes to remove the corresponding functionality there.

        Show
        Paul Elschot added a comment - - edited The 20080529 patch patches CachingWrapperFilter and its test to add a choice of a compact filter to be cached, as well as some recently patched ( LUCENE-1187 ) contrib filter classes to remove the corresponding functionality there.
        Hide
        Paul Elschot added a comment - - edited

        Once more, with licence and copyright assigned to ASF.

        Show
        Paul Elschot added a comment - - edited Once more, with licence and copyright assigned to ASF.
        Hide
        Paul Elschot added a comment -

        I tried to come up with a sensible performance test to determine a good criterium to choose between OpenBitSet and SortedVIntList as the DocIdSet supporting data structure to be cached.
        There is a criterium for this in the patch in docIdSetToCache() method of CachingWrapperFilter, but it's only based on byte size, and it favours SortedVIntList when it is defenitely more compact than OpenBitSet.

        The current criterium is to use (cardinality (=nr bits set in OpenBitSet) < maxDocs/9) as a test to prefer SortedVIntList over OpenBitSet for caching. The constant 9 might be replaced by a configuration parameter to allow easy performance experiments there. It could be that a larger value than 9 is turns out to be "optimal" in runtime.

        In some cases OpenBitSet can be faster on skipTo(int docNum) than SortedVIntList, even when SortedVIntList is more compact. As Filters can be expected to use skipTo() heavily, this could be important for performance.

        Even even though it might be possible to measure the skipTo() performance directly, the effect of the more compact cached data structure of SortedVIntList on garbage collection is (pretty close to) impossible to measure in a simple test case.

        Eks Dev had some interesting results there in the very early stages of LUCENE-584 (September 2006), so I wonder whether these results could be confirmed somehow using the patch here and the current trunk.

        Comments?

        Show
        Paul Elschot added a comment - I tried to come up with a sensible performance test to determine a good criterium to choose between OpenBitSet and SortedVIntList as the DocIdSet supporting data structure to be cached. There is a criterium for this in the patch in docIdSetToCache() method of CachingWrapperFilter, but it's only based on byte size, and it favours SortedVIntList when it is defenitely more compact than OpenBitSet. The current criterium is to use (cardinality (=nr bits set in OpenBitSet) < maxDocs/9) as a test to prefer SortedVIntList over OpenBitSet for caching. The constant 9 might be replaced by a configuration parameter to allow easy performance experiments there. It could be that a larger value than 9 is turns out to be "optimal" in runtime. In some cases OpenBitSet can be faster on skipTo(int docNum) than SortedVIntList, even when SortedVIntList is more compact. As Filters can be expected to use skipTo() heavily, this could be important for performance. Even even though it might be possible to measure the skipTo() performance directly, the effect of the more compact cached data structure of SortedVIntList on garbage collection is (pretty close to) impossible to measure in a simple test case. Eks Dev had some interesting results there in the very early stages of LUCENE-584 (September 2006), so I wonder whether these results could be confirmed somehow using the patch here and the current trunk. Comments?
        Hide
        Paul Elschot added a comment -

        For the record: the patch of 20080529 leaves some imports of SortedVIntList unused.

        Show
        Paul Elschot added a comment - For the record: the patch of 20080529 leaves some imports of SortedVIntList unused.
        Hide
        Paul Elschot added a comment -

        In the 20080605 patch the docIdSetToCache method simply returns its argument, which would normally be an OpenBitSet when using a Filter from the core. Anyone who wants to have another filter data structure cached can override this method.

        Show
        Paul Elschot added a comment - In the 20080605 patch the docIdSetToCache method simply returns its argument, which would normally be an OpenBitSet when using a Filter from the core. Anyone who wants to have another filter data structure cached can override this method.
        Hide
        Paul Elschot added a comment -

        Once more, with licence granted to ASF.

        Show
        Paul Elschot added a comment - Once more, with licence granted to ASF.
        Hide
        Paul Elschot added a comment -

        I think this is ready to commit.
        Even so, I'd like to repeat that it also removes the choice of another DocIdSet in the contrib search/BooleanFilter and misc/ChainedFilter.
        In the latest patch, the actual choice of another DocIdSet is left to subclasses of CachingWrapperFilter that are not implemented.

        One detail: for maximum flexibility, the 2nd argument to the protected method docIdSetToCache might be replaced by the indexreader. At the moment the 2nd arg is maxDoc of the indexreader, which is the only info I have needed so far to chose another DocIdSet.

        Show
        Paul Elschot added a comment - I think this is ready to commit. Even so, I'd like to repeat that it also removes the choice of another DocIdSet in the contrib search/BooleanFilter and misc/ChainedFilter. In the latest patch, the actual choice of another DocIdSet is left to subclasses of CachingWrapperFilter that are not implemented. One detail: for maximum flexibility, the 2nd argument to the protected method docIdSetToCache might be replaced by the indexreader. At the moment the 2nd arg is maxDoc of the indexreader, which is the only info I have needed so far to chose another DocIdSet.
        Hide
        Paul Elschot added a comment - - edited

        This didn't make it into 2.4.

        A side effect of that is that SortedVIntList will sometimes be used instead of an OpenBitSet in contrib search/BooleanFilter and misc/ChainedFilter. So far no problems have surfaced there, which is good news. I wonder if there was any performance improvement in practice because of this.

        Show
        Paul Elschot added a comment - - edited This didn't make it into 2.4. A side effect of that is that SortedVIntList will sometimes be used instead of an OpenBitSet in contrib search/BooleanFilter and misc/ChainedFilter. So far no problems have surfaced there, which is good news. I wonder if there was any performance improvement in practice because of this.
        Hide
        Michael McCandless added a comment -

        Sigh. Should have marked it as 2.4 fix version

        One detail: for maximum flexibility, the 2nd argument to the protected method docIdSetToCache might be replaced by the indexreader.

        +1

        Show
        Michael McCandless added a comment - Sigh. Should have marked it as 2.4 fix version One detail: for maximum flexibility, the 2nd argument to the protected method docIdSetToCache might be replaced by the indexreader. +1
        Hide
        Michael McCandless added a comment -

        I guess we should deprecate the finalResult methods in BooleanFilter & ChainedFilter, and suggest using CachingWrapperFilter instead?

        Also, why not put your logic to sometimes choose a SortedVIntList impl into CachingWrapperFilter.docIdSetToCache by default?

        Show
        Michael McCandless added a comment - I guess we should deprecate the finalResult methods in BooleanFilter & ChainedFilter, and suggest using CachingWrapperFilter instead? Also, why not put your logic to sometimes choose a SortedVIntList impl into CachingWrapperFilter.docIdSetToCache by default?
        Hide
        Paul Elschot added a comment -

        Also, why not put your logic to sometimes choose a SortedVIntList impl intoCachingWrapperFilter.docIdSetToCache by default?

        The main reason is that there are circumstances under which skipTo() is faster on an OpenBitSet than on a SortedVIntList. OpenBitSet allows random access, so it can start the skip from any point, but SortedVIntList can only start the skip from its current position.
        OTOH SortedVIntList does have the advantage of being smaller when the set is sparse, and this may bring garbage collection advantages.
        In all, not completely convincing either way.

        Show
        Paul Elschot added a comment - Also, why not put your logic to sometimes choose a SortedVIntList impl intoCachingWrapperFilter.docIdSetToCache by default? The main reason is that there are circumstances under which skipTo() is faster on an OpenBitSet than on a SortedVIntList. OpenBitSet allows random access, so it can start the skip from any point, but SortedVIntList can only start the skip from its current position. OTOH SortedVIntList does have the advantage of being smaller when the set is sparse, and this may bring garbage collection advantages. In all, not completely convincing either way.
        Hide
        Michael McCandless added a comment -

        Paul are you going to pull together another patch here?

        Show
        Michael McCandless added a comment - Paul are you going to pull together another patch here?
        Hide
        Paul Elschot added a comment -

        Adds a docIdSetToCache method to CachingWrapperFilter.
        Removes the choice of a compact underlying data structure from contrib filters.

        Show
        Paul Elschot added a comment - Adds a docIdSetToCache method to CachingWrapperFilter. Removes the choice of a compact underlying data structure from contrib filters.
        Hide
        Michael McCandless added a comment -

        It looks like the patch removed finalResult from contrib's ChainedFilter but not from contrib's BooleanFilter?

        Show
        Michael McCandless added a comment - It looks like the patch removed finalResult from contrib's ChainedFilter but not from contrib's BooleanFilter?
        Hide
        Paul Elschot added a comment -

        This time with finalResult() also removed from BooleanFilter.

        Show
        Paul Elschot added a comment - This time with finalResult() also removed from BooleanFilter.
        Hide
        Michael McCandless added a comment -

        Paul, I reverted the changes to ChainedFilter & BooleanFilter, and instead deprecated the new finalResult method. It's dangerous to just remove protected methods since on upgrading there will be no errors but, silently, the finalResult method will no longer be called. I think in 3.0 when we remove these methods, rather than simply removing them we should actually mark them final such that any subclasses still using them will see hard compilation errors.

        Can you look over the new patch?

        Show
        Michael McCandless added a comment - Paul, I reverted the changes to ChainedFilter & BooleanFilter, and instead deprecated the new finalResult method. It's dangerous to just remove protected methods since on upgrading there will be no errors but, silently, the finalResult method will no longer be called. I think in 3.0 when we remove these methods, rather than simply removing them we should actually mark them final such that any subclasses still using them will see hard compilation errors. Can you look over the new patch?
        Hide
        Paul Elschot added a comment -

        Indeed it is better to be conservative about released things as in today's patch.

        Show
        Paul Elschot added a comment - Indeed it is better to be conservative about released things as in today's patch.
        Hide
        Michael McCandless added a comment -

        Committed revision 722174.

        Thanks Paul!

        Show
        Michael McCandless added a comment - Committed revision 722174. Thanks Paul!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Paul Elschot
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development