Lucene - Core
  1. Lucene - Core
  2. LUCENE-1187

Things to be done now that Filter is independent from BitSet

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: core/search, modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      (Aside: where is the documentation on how to mark up text in jira comments?)

      The following things are left over after LUCENE-584 :

      For Lucene 3.0 Filter.bits() will have to be removed.

      There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the boolean behaviour of a Filter.

      I have not looked into Filter caching yet, but I suppose there will be some room for improvement there.
      Iirc the current core has moved to use OpenBitSetFilter and that is probably what is being cached.
      In some cases it might be better to cache a SortedVIntList instead.

      Boolean logic on DocIdSetIterator is already available for Scorers (that inherit from DocIdSetIterator) in the search package. This is currently implemented by ConjunctionScorer, DisjunctionSumScorer,
      ReqOptSumScorer and ReqExclScorer.
      Boolean logic on BitSets is available in contrib/misc and contrib/queries

      DisjunctionSumScorer calls score() on its subscorers before the score value actually needed.
      This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as a superclass of DisjunctionSumScorer.

      To fully implement non scoring queries a TermDocIdSetIterator will be needed, perhaps as a superclass of TermScorer.

      The javadocs in org.apache.lucene.search using matching vs non-zero score:
      I'll investigate this soon, and provide a patch when necessary.

      An early version of the patches of LUCENE-584 contained a class Matcher,
      that differs from the current DocIdSet in that Matcher has an explain() method.
      It remains to be seen whether such a Matcher could be useful between
      DocIdSet and Scorer.

      The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
      This was also discussed at another issue recently, so perhaps it is wortwhile to open a separate issue for this.

      Skipping on a SortedVIntList is done using linear search, this could be improved by adding multilevel skiplist info much like in the Lucene index for documents containing a term.

      One comment by me of 3 Dec 2008:

      A few complete (test) classes are deprecated, it might be good to add the target release for removal there.

      1. BooleanFilter20080325.patch
        17 kB
        Paul Elschot
      2. ChainedFilterAndCachingFilterTest.patch
        2 kB
        Mark Miller
      3. Contrib20080325.patch
        26 kB
        Paul Elschot
      4. Contrib20080326.patch
        27 kB
        Paul Elschot
      5. Contrib20080427.patch
        28 kB
        Paul Elschot
      6. javadocsZero2Match.patch
        2 kB
        Paul Elschot
      7. lucene-1187.patch
        48 kB
        Michael Busch
      8. lucene-1187.patch
        32 kB
        Michael Busch
      9. OpenBitSetDISI-20080322.patch
        2 kB
        Paul Elschot

        Activity

        Hide
        Paul Elschot added a comment -

        I did something wrong here, I wanted to review the text above before posting it.
        I'm sorry about that, I'll just continue here, when it gets too messy, another jira issue can easily be opened.

        Show
        Paul Elschot added a comment - I did something wrong here, I wanted to review the text above before posting it. I'm sorry about that, I'll just continue here, when it gets too messy, another jira issue can easily be opened.
        Hide
        Paul Elschot added a comment -

        Iirc the boolean logic on contrib/queries is defined in two places: ChainedFilter and BooleanFilter. Ideally these could me merged and their functions be implemented by the DocIdSetIterators underlying the current scorers used by BooleanScorer2 (Conjunction/Disjunction/ReqOpt/ReqExcl). See also the comments of Micheal Bush and Eks Dev at the end of Jan 2008 at LUCENE-584.

        That's it for the moment. Sorry for the mess, at least it should save others from reviewing all the comments at LUCENE-584. I hope I have not missed anything...

        Show
        Paul Elschot added a comment - Iirc the boolean logic on contrib/queries is defined in two places: ChainedFilter and BooleanFilter. Ideally these could me merged and their functions be implemented by the DocIdSetIterators underlying the current scorers used by BooleanScorer2 (Conjunction/Disjunction/ReqOpt/ReqExcl). See also the comments of Micheal Bush and Eks Dev at the end of Jan 2008 at LUCENE-584 . That's it for the moment. Sorry for the mess, at least it should save others from reviewing all the comments at LUCENE-584 . I hope I have not missed anything...
        Hide
        Eks Dev added a comment -

        Paul, I think there is one CHEKME in DisjunctionSumScorer I have stumbled upon recently when I realized
        (token1+ token2+) query works way faster than (token1 token2).setMinimumSholdMatch(2). It is not directly related to the LUCENE-584, but just as a reminder.

        also I think there is a hard_to_detect_small_maybe_performance_bug in ConjuctionScorer, :

            // If first-time skip distance is any predictor of
            // scorer sparseness, then we should always try to skip first on
            // those scorers.
            // Keep last scorer in it's last place (it will be the first
            // to be skipped on), but reverse all of the others so that
            // they will be skipped on in order of original high skip.
            int end=(scorers.length-1)-1;
            for (int i=0; i<(end>>1); i++) {
              Scorer tmp = scorers[i];
              scorers[i] = scorers[end-i];
              scorers[end-i] = tmp;
            }
        

        It has not been detected so far as it has only performance implications (I think?), and it sometimes works and sometimes not, depending on number of scorers:

        to see what I am talking about, try this "simulator":

          public static void main(String[] args) {
            int[] scorers = new int[7]; //3 and 7 do not work
           
            for (int i=0; i<scorers.length; i++) {
              scorers[i]=i;
            }
           
            System.out.println(Arrays.toString(scorers));
           
           
            int end=(scorers.length-1)-1;
            for (int i=0; i<(end>>1); i++) {
              int tmp = scorers[i];
              scorers[i] = scorers[end-i];
              scorers[end-i] = tmp;
            }
        
            System.out.println(Arrays.toString(scorers));
        
          }
        
        

        for 7 you get:
        [0, 1, 2, 3, 4, 5, 6]
        [5, 4, 2, 3, 1, 0, 6]

        instead of [5, 4, 3, 2, 1, 0, 6]

        and for 3
        [0, 1, 2]
        [0, 1, 2] (should be [1, 0, 2])

        Show
        Eks Dev added a comment - Paul, I think there is one CHEKME in DisjunctionSumScorer I have stumbled upon recently when I realized (token1+ token2+) query works way faster than (token1 token2).setMinimumSholdMatch(2). It is not directly related to the LUCENE-584 , but just as a reminder. also I think there is a hard_to_detect_small_maybe_performance_bug in ConjuctionScorer, : // If first-time skip distance is any predictor of // scorer sparseness, then we should always try to skip first on // those scorers. // Keep last scorer in it's last place (it will be the first // to be skipped on), but reverse all of the others so that // they will be skipped on in order of original high skip. int end=(scorers.length-1)-1; for ( int i=0; i<(end>>1); i++) { Scorer tmp = scorers[i]; scorers[i] = scorers[end-i]; scorers[end-i] = tmp; } It has not been detected so far as it has only performance implications (I think?), and it sometimes works and sometimes not, depending on number of scorers: to see what I am talking about, try this "simulator": public static void main( String [] args) { int [] scorers = new int [7]; //3 and 7 do not work for ( int i=0; i<scorers.length; i++) { scorers[i]=i; } System .out.println(Arrays.toString(scorers)); int end=(scorers.length-1)-1; for ( int i=0; i<(end>>1); i++) { int tmp = scorers[i]; scorers[i] = scorers[end-i]; scorers[end-i] = tmp; } System .out.println(Arrays.toString(scorers)); } for 7 you get: [0, 1, 2, 3, 4, 5, 6] [5, 4, 2, 3, 1, 0, 6] instead of [5, 4, 3, 2, 1, 0, 6] and for 3 [0, 1, 2] [0, 1, 2] (should be [1, 0, 2] )
        Hide
        Paul Elschot added a comment -

        Eks,

        As both issues you mention are not related to filters, could you open a new issue for each of them?
        For the first issue: iirc BooleanScorer2 will use a ConjunctionScorer in the case when all clauses are actually required in a disjunction, so normal usage via BooleanQuery should not have a performance problem there.
        The second issue is beyond me at the moment.

        Regards,
        Paul Elschot

        Show
        Paul Elschot added a comment - Eks, As both issues you mention are not related to filters, could you open a new issue for each of them? For the first issue: iirc BooleanScorer2 will use a ConjunctionScorer in the case when all clauses are actually required in a disjunction, so normal usage via BooleanQuery should not have a performance problem there. The second issue is beyond me at the moment. Regards, Paul Elschot
        Hide
        Yonik Seeley added a comment -

        also I think there is a hard_to_detect_small_maybe_performance_bug in ConjuctionScorer

        Yup, I introduced that feature + bug. I just committed a fix that makes the code do what the comments say (no correctness implications, just perhaps minor performance).

        Show
        Yonik Seeley added a comment - also I think there is a hard_to_detect_small_maybe_performance_bug in ConjuctionScorer Yup, I introduced that feature + bug. I just committed a fix that makes the code do what the comments say (no correctness implications, just perhaps minor performance).
        Hide
        Paul Elschot added a comment -

        Attached javadocsZero2Match.patch that replaces the last few occurrences 'zero scoring' in the java docs of org.apache.lucene.search by 'matching'.

        Show
        Paul Elschot added a comment - Attached javadocsZero2Match.patch that replaces the last few occurrences 'zero scoring' in the java docs of org.apache.lucene.search by 'matching'.
        Hide
        Mark Miller added a comment -

        Test that now fails with ChainedFilter.

        java.lang.ClassCastException: org.apache.lucene.util.OpenBitSet cannot be cast to java.util.BitSet
        	at org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:55)
        	at org.apache.lucene.misc.ChainedFilter.doChain(ChainedFilter.java:258)
        	at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:193)
        	at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:156)
        	at org.apache.lucene.search.Filter.getDocIdSet(Filter.java:49)
        	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:141)
        
        Show
        Mark Miller added a comment - Test that now fails with ChainedFilter. java.lang.ClassCastException: org.apache.lucene.util.OpenBitSet cannot be cast to java.util.BitSet at org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:55) at org.apache.lucene.misc.ChainedFilter.doChain(ChainedFilter.java:258) at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:193) at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:156) at org.apache.lucene.search.Filter.getDocIdSet(Filter.java:49) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:141)
        Hide
        Michael Busch added a comment -

        Test that now fails with ChainedFilter.

        The reason apparently is that the core moved from BitSets to OpenBitSets,
        whereas the contrib packages haven't.

        If we change the contrib packages to also use OpenBitSets, then this is
        still not completely backwards compatible. For example, if a user upgrades
        to 2.4, uses a ChainedFilter to combine a 2.4 core filter with their own
        custom Filter that is based on 2.3 and thus uses a BitSet, then it won't
        work. So a simple drop-in replacement with the new lucene jar would not
        be possible, the user would have to change their own filters.

        Maybe we should introduce a DocIdSetFactory in the core? For
        backwards compatibility a factory that produces BitSets can be used,
        for speed one that creates OpenBitSets. Thoughts?

        Show
        Michael Busch added a comment - Test that now fails with ChainedFilter. The reason apparently is that the core moved from BitSets to OpenBitSets, whereas the contrib packages haven't. If we change the contrib packages to also use OpenBitSets, then this is still not completely backwards compatible. For example, if a user upgrades to 2.4, uses a ChainedFilter to combine a 2.4 core filter with their own custom Filter that is based on 2.3 and thus uses a BitSet, then it won't work. So a simple drop-in replacement with the new lucene jar would not be possible, the user would have to change their own filters. Maybe we should introduce a DocIdSetFactory in the core? For backwards compatibility a factory that produces BitSets can be used, for speed one that creates OpenBitSets. Thoughts?
        Hide
        Eks Dev added a comment -

        Michael,
        I do not think we need to add Factory (for this particular reason), DocIdSet type should not be assumed as we could come up with smart ways to select optimal Filter representation depending on doc-id distribution, size...

        The only problem we have with is that contrib classes, ChainedFilter and BooleanFilter assume BitSet.
        And the solution for this would be to add just a few methods to the DocIdSet that are able to do AND/OR/NOT on DocIdSet[] using DocIdSetIterator()
        e.g.
        DocIdSet or(DocIdSet[], int minimumShouldMatch);
        DocIdSet or(DocIdSet[]);

        Optimized code for these basic operations already exists, can be copied from Conjunction/Disjunction/ReqOpt/ReqExcl Scorer classes by just simply stripping-off scoring part.

        with these utility methods in DocIdSet, rewriting ChainedFilter/BooleanFilter to work with DocIdSet (and that works on all implementations of Fileter/DocIdSet) is 10 minutes job... than, if needed this implementation can be optimized to cover type specific cases. Imo, BoolenFilter is better bet, we do not need both of them.

        Unfortunately I do not have time to play with it next 3-4 weeks, but should be no more than 2 days work (remember, we have difficult part already done in Scorers). Having so much code duplication is not something really good, but we can then later "merge" these somehow.

        Show
        Eks Dev added a comment - Michael, I do not think we need to add Factory (for this particular reason), DocIdSet type should not be assumed as we could come up with smart ways to select optimal Filter representation depending on doc-id distribution, size... The only problem we have with is that contrib classes, ChainedFilter and BooleanFilter assume BitSet. And the solution for this would be to add just a few methods to the DocIdSet that are able to do AND/OR/NOT on DocIdSet[] using DocIdSetIterator() e.g. DocIdSet or(DocIdSet[], int minimumShouldMatch); DocIdSet or(DocIdSet[]); Optimized code for these basic operations already exists , can be copied from Conjunction/Disjunction/ReqOpt/ReqExcl Scorer classes by just simply stripping-off scoring part. with these utility methods in DocIdSet, rewriting ChainedFilter/BooleanFilter to work with DocIdSet (and that works on all implementations of Fileter/DocIdSet) is 10 minutes job... than, if needed this implementation can be optimized to cover type specific cases. Imo, BoolenFilter is better bet, we do not need both of them. Unfortunately I do not have time to play with it next 3-4 weeks, but should be no more than 2 days work (remember, we have difficult part already done in Scorers). Having so much code duplication is not something really good, but we can then later "merge" these somehow.
        Hide
        Paul Elschot added a comment -

        I started adding getDocIdSet() to BooleanFilter of contrib/queries. When trying to collect the interim results into an OpenBitSet I soon needed OpenBitSet.conjunction(DocIdSetIterator), as well as similar disjunction() and exclusion() methods.

        Would it be ok to add such methods to Lucene's OpenBitSet, or would it be preferable to subclass OpenBitSet for this? At first sight I prefer subclassing, but I'd like to hear some opinions on this before going further.

        Show
        Paul Elschot added a comment - I started adding getDocIdSet() to BooleanFilter of contrib/queries. When trying to collect the interim results into an OpenBitSet I soon needed OpenBitSet.conjunction(DocIdSetIterator), as well as similar disjunction() and exclusion() methods. Would it be ok to add such methods to Lucene's OpenBitSet, or would it be preferable to subclass OpenBitSet for this? At first sight I prefer subclassing, but I'd like to hear some opinions on this before going further.
        Hide
        Paul Elschot added a comment -

        The OpenBitSetDISI-20080322.patch illustrates my previous question.
        DISI means DocIdSetIterator, for want of a better name.
        The patch compiles, but is untested.

        Show
        Paul Elschot added a comment - The OpenBitSetDISI-20080322.patch illustrates my previous question. DISI means DocIdSetIterator, for want of a better name. The patch compiles, but is untested.
        Hide
        Paul Elschot added a comment -

        For the record: the and() method in the patch misses a last statement: clear(index, size());

        Show
        Paul Elschot added a comment - For the record: the and() method in the patch misses a last statement: clear(index, size());
        Hide
        Paul Elschot added a comment -

        BooleanFilter20080325.patch contains a non BitSet version of BooleanFilter. The contrib/queries tests pass, and it includes a finished version of OpenBitSetDISI of a few days ago.
        I've also changed the indentation of all of BooleanFilter and added some minor refactorings. This makes the patch itself somewhat less easy to read, but I couldn't leave the indendation in different styles.

        Show
        Paul Elschot added a comment - BooleanFilter20080325.patch contains a non BitSet version of BooleanFilter. The contrib/queries tests pass, and it includes a finished version of OpenBitSetDISI of a few days ago. I've also changed the indentation of all of BooleanFilter and added some minor refactorings. This makes the patch itself somewhat less easy to read, but I couldn't leave the indendation in different styles.
        Hide
        Paul Elschot added a comment -

        All tests pass, except the one for ChainedFilter provided here by Mark Miller. Stay tuned.

        Show
        Paul Elschot added a comment - All tests pass, except the one for ChainedFilter provided here by Mark Miller. Stay tuned.
        Hide
        Paul Elschot added a comment -

        The Contrib20080325 also includes Mark Miller's test patch, and a non BitSet version of ChainedFilter. It passes the tests contrib/miscellaneous, no time left to run all tests.

        Contrib20080325.patch should supersede all patches currently here, except the javadoc patch.

        There are no type checks yet to use the inter OpenBitSet boolean operations directly, but at least it should work. Remember to profile before adding such optimisations, for sparse OpenBitSets this could well be competitive. Well, I'd hope so.

        Show
        Paul Elschot added a comment - The Contrib20080325 also includes Mark Miller's test patch, and a non BitSet version of ChainedFilter. It passes the tests contrib/miscellaneous, no time left to run all tests. Contrib20080325.patch should supersede all patches currently here, except the javadoc patch. There are no type checks yet to use the inter OpenBitSet boolean operations directly, but at least it should work. Remember to profile before adding such optimisations, for sparse OpenBitSets this could well be competitive. Well, I'd hope so.
        Hide
        Paul Elschot added a comment -

        Contrib20080326.patch: supersedes the 20080325 version. Generally the same as yesterday, some extensions:

        • fix a possible synchronisation issue by using a local int[1] array instead of an object int attribute,
        • return a SortedVIntList when it is definitely smaller than an OpenBitSet, the method doing this is protected.
        • all constructors in OpenBitSetDISI now also take a initial size argument
          (still called maxSize, perhaps better renamed to initialSize).

        Both ChainedFilter and BooleanFilter should work normally, except perhaps using less memory because of the SortedVIntList.

        ChainedFilter still has the 1.1 ASL, it's probably time to upgrade it, but I did not change it in the patch.

        Show
        Paul Elschot added a comment - Contrib20080326.patch: supersedes the 20080325 version. Generally the same as yesterday, some extensions: fix a possible synchronisation issue by using a local int [1] array instead of an object int attribute, return a SortedVIntList when it is definitely smaller than an OpenBitSet, the method doing this is protected. all constructors in OpenBitSetDISI now also take a initial size argument (still called maxSize, perhaps better renamed to initialSize). Both ChainedFilter and BooleanFilter should work normally, except perhaps using less memory because of the SortedVIntList. ChainedFilter still has the 1.1 ASL, it's probably time to upgrade it, but I did not change it in the patch.
        Hide
        Michael Busch added a comment -

        Thanks for your patches, Paul. I'll be traveling the next days, but I'll try to look at the patches next week.

        Show
        Michael Busch added a comment - Thanks for your patches, Paul. I'll be traveling the next days, but I'll try to look at the patches next week.
        Hide
        Paul Elschot added a comment -

        One thing: I added the max size parameter to the OpenBitSetDISI ctor rather late, so there is probably some room to use more of the fast... bit access methods of OpenBitSet.

        Show
        Paul Elschot added a comment - One thing: I added the max size parameter to the OpenBitSetDISI ctor rather late, so there is probably some room to use more of the fast... bit access methods of OpenBitSet.
        Hide
        Paul Elschot added a comment -

        Contrib20080427.patch is the same as the previous one from March, except for OpenBitSetDISI: added javadocs there and use fast... bit access methods consistently.

        Show
        Paul Elschot added a comment - Contrib20080427.patch is the same as the previous one from March, except for OpenBitSetDISI: added javadocs there and use fast... bit access methods consistently.
        Hide
        Mark Harwood added a comment -

        Paul,
        Good work.
        Just tried the patch and ran some pre and post-patch benchmarks.

        I wanted to measure the overhead of :
        the new OpenBitSetDISI.inPlaceOr(DocIdSetIterator)
        vs
        the previous scheme of BitSet.or(BitSet).

        My test was on the biggest index I have here which was 3 million Wikipedia docs. I had 2 cached TermFilters on very popular terms (500k docs in each) and was measuring the cost of combining these as 2 "shoulds" in a BooleanFilter.
        The expectation was the new scheme would add some overhead in extra method calls.

        The average cost of iterating across BooleanFilter.getDocIdSet() was:

        old BitSet scheme: 78 milliseconds
        new DISI scheme: 156 milliseconds.

        To address this I tried adding this optimisation into BooleanFilter...

        DocIdSet dis = ((Filter)shouldFilters.get).getDocIdSet(reader);
        if(dis instanceof OpenBitSet)

        { res.or((OpenBitSet) dis); // go-faster method }

        else

        { res.inPlaceOr(getDISI(shouldFilters, i, reader)); //your patch code }

        Before I could benchmark this I had to amend TermsFilter to use OpenBitSet rather than plain old BitSet

        avg speed of your patch with OpenBitSet-enabled TermFilter : 100 milliseconds
        avg speed of your patch with OpenBitSet-enabled TermFilter and above optimisation : 70 milliseconds

        I'll try and post a proper patch when I get more time to look at this...

        Cheers,
        Mark

        Show
        Mark Harwood added a comment - Paul, Good work. Just tried the patch and ran some pre and post-patch benchmarks. I wanted to measure the overhead of : the new OpenBitSetDISI.inPlaceOr(DocIdSetIterator) vs the previous scheme of BitSet.or(BitSet). My test was on the biggest index I have here which was 3 million Wikipedia docs. I had 2 cached TermFilters on very popular terms (500k docs in each) and was measuring the cost of combining these as 2 "shoulds" in a BooleanFilter. The expectation was the new scheme would add some overhead in extra method calls. The average cost of iterating across BooleanFilter.getDocIdSet() was: old BitSet scheme: 78 milliseconds new DISI scheme: 156 milliseconds. To address this I tried adding this optimisation into BooleanFilter... DocIdSet dis = ((Filter)shouldFilters.get ).getDocIdSet(reader); if(dis instanceof OpenBitSet) { res.or((OpenBitSet) dis); // go-faster method } else { res.inPlaceOr(getDISI(shouldFilters, i, reader)); //your patch code } Before I could benchmark this I had to amend TermsFilter to use OpenBitSet rather than plain old BitSet avg speed of your patch with OpenBitSet-enabled TermFilter : 100 milliseconds avg speed of your patch with OpenBitSet-enabled TermFilter and above optimisation : 70 milliseconds I'll try and post a proper patch when I get more time to look at this... Cheers, Mark
        Hide
        Paul Elschot added a comment -

        That sounds like the overhead of the DocIdSetIterator is never more than directly using an OpenBitSet in the dense cases that you tested so far (1 in 6 is more than 1 bit per byte).

        That means that a DocIdSetIterator should have acceptable performance in the sparse case when it uses a SortedVIntList underneath. Could you share some test results for sparse cases as well? I'd expect it to outperform OpenBitSet even at CPU time when it is sparse enough.

        Show
        Paul Elschot added a comment - That sounds like the overhead of the DocIdSetIterator is never more than directly using an OpenBitSet in the dense cases that you tested so far (1 in 6 is more than 1 bit per byte). That means that a DocIdSetIterator should have acceptable performance in the sparse case when it uses a SortedVIntList underneath. Could you share some test results for sparse cases as well? I'd expect it to outperform OpenBitSet even at CPU time when it is sparse enough.
        Hide
        Michael Busch added a comment -

        Good benchmarking, Mark.

        I'm actually wondering about the performance of BooleanFilter, when used
        with a mix of Filters where some use OpenBitSets and others use BitSets.

        This will happen when users upgrade to Lucene 2.4 (core filters use
        OpenBitSets now) and keep using their own custom Filters that use
        BitSets.

        Your patch, Paul, makes this combination possible, and thus guarantees
        backwards-compatibility of BooleanFilter and ChainedFilter, which is
        great! I'm just wondering if those user might encounter bad performance
        surprises?

        Show
        Michael Busch added a comment - Good benchmarking, Mark. I'm actually wondering about the performance of BooleanFilter, when used with a mix of Filters where some use OpenBitSets and others use BitSets. This will happen when users upgrade to Lucene 2.4 (core filters use OpenBitSets now) and keep using their own custom Filters that use BitSets. Your patch, Paul, makes this combination possible, and thus guarantees backwards-compatibility of BooleanFilter and ChainedFilter, which is great! I'm just wondering if those user might encounter bad performance surprises?
        Hide
        Paul Elschot added a comment -

        I would not expect bad performance problems mixing OpenBitSet and BitSet for Filters using this patch, although some performance may be lost. However, if a problem surfaces, the solution is to upgrade the existing Filter from BitSet to OpenBitSet, which amounts to the expected work after deprecation of BitSet in Filter.
        A direct implementation of the bit manipulation operations on an OpenBitSet from a BitSet would probably be faster (taking the DocIdSetIterator out of the loop), but at the moment I see no good reason to implement that.

        Show
        Paul Elschot added a comment - I would not expect bad performance problems mixing OpenBitSet and BitSet for Filters using this patch, although some performance may be lost. However, if a problem surfaces, the solution is to upgrade the existing Filter from BitSet to OpenBitSet, which amounts to the expected work after deprecation of BitSet in Filter. A direct implementation of the bit manipulation operations on an OpenBitSet from a BitSet would probably be faster (taking the DocIdSetIterator out of the loop), but at the moment I see no good reason to implement that.
        Hide
        Michael Busch added a comment -

        the solution is to upgrade the existing Filter from BitSet to OpenBitSet, which amounts to the expected work after deprecation of BitSet in Filter.

        I agree.

        Show
        Michael Busch added a comment - the solution is to upgrade the existing Filter from BitSet to OpenBitSet, which amounts to the expected work after deprecation of BitSet in Filter. I agree.
        Hide
        Michael Busch added a comment -

        The patches look good to me, Paul!

        I'm attaching a new file that contains both your patches Contrib20080427.patch
        and javadocsZero2Match.patch. I also added the optimizations for OpenBitSets
        that Mark suggested to BooleanFilter and ChainedFilter.
        And I added the check (disi.doc() < size()) to OpenBitSetDISI.inPlaceOr() and
        OpenBitSetDISI.inPlaceXor().

        All unit tests pass, however I think they don't cover all code paths now. We
        should test both the ChainedFilter and the BooleanFilter on OpenBitSet-Filters
        only as well as on combinations of different filters.

        Paul, maybe you can help me with adding those tests? Then I would go ahead
        and commit this patch soon. Otherwise I probably won't have time before next
        week.

        Show
        Michael Busch added a comment - The patches look good to me, Paul! I'm attaching a new file that contains both your patches Contrib20080427.patch and javadocsZero2Match.patch. I also added the optimizations for OpenBitSets that Mark suggested to BooleanFilter and ChainedFilter. And I added the check (disi.doc() < size()) to OpenBitSetDISI.inPlaceOr() and OpenBitSetDISI.inPlaceXor(). All unit tests pass, however I think they don't cover all code paths now. We should test both the ChainedFilter and the BooleanFilter on OpenBitSet-Filters only as well as on combinations of different filters. Paul, maybe you can help me with adding those tests? Then I would go ahead and commit this patch soon. Otherwise I probably won't have time before next week.
        Hide
        Paul Elschot added a comment -

        With the size tests added in OpenBitSetDISI, the javadocs of the changed methods could also be relaxed.

        The filter tests in the contrib modules misc and queries look fairly complete to me in their current state. Did I overlook anything there? I don't have a coverage test tool here.

        Show
        Paul Elschot added a comment - With the size tests added in OpenBitSetDISI, the javadocs of the changed methods could also be relaxed. The filter tests in the contrib modules misc and queries look fairly complete to me in their current state. Did I overlook anything there? I don't have a coverage test tool here.
        Hide
        Michael Busch added a comment -

        I added a test helper class called OldBitSetFilterWrapper that helps to
        test compatibility with old filters based on BitSets.

        I changed ChainedFilterTest and BooleanFilterTest to run all tests on
        new (OpenBitSet) and old (BitSet) filters to test both different code paths
        that we have for OpenBitSet and DocIdSetIterator.

        I verified with a code coverage tool that now all those paths are covered
        and all tests pass.

        Would be nice if you could quickly review the patch, Paul. If you're ok
        with it then I'll commit it tomorrow.

        Show
        Michael Busch added a comment - I added a test helper class called OldBitSetFilterWrapper that helps to test compatibility with old filters based on BitSets. I changed ChainedFilterTest and BooleanFilterTest to run all tests on new (OpenBitSet) and old (BitSet) filters to test both different code paths that we have for OpenBitSet and DocIdSetIterator. I verified with a code coverage tool that now all those paths are covered and all tests pass. Would be nice if you could quickly review the patch, Paul. If you're ok with it then I'll commit it tomorrow.
        Hide
        Paul Elschot added a comment -

        Thanks for the backward compatibility additions in the tests.

        One nice little detail: the patch contains this in OldBitSetFilterWrapper:

        +      BitSet bits = new BitSet(reader.maxDoc());
        +      DocIdSetIterator it = filter.getDocIdSet(reader).iterator();
        ...
        

        but I expected:

        BitSet bits = filter.bits(reader); // use deprecated method
        

        On old filters both will versions will work, but the alternative makes it explicit that the filter must be an 'old' one.
        Using the deprecated method would have the advantage that it (the whole wrapper class in fact) would have to be removed in 3.0.

        Show
        Paul Elschot added a comment - Thanks for the backward compatibility additions in the tests. One nice little detail: the patch contains this in OldBitSetFilterWrapper: + BitSet bits = new BitSet(reader.maxDoc()); + DocIdSetIterator it = filter.getDocIdSet(reader).iterator(); ... but I expected: BitSet bits = filter.bits(reader); // use deprecated method On old filters both will versions will work, but the alternative makes it explicit that the filter must be an 'old' one. Using the deprecated method would have the advantage that it (the whole wrapper class in fact) would have to be removed in 3.0.
        Hide
        Michael Busch added a comment -

        Using the deprecated method would have the advantage that it (the whole wrapper class in fact) would have to be removed in 3.0.

        Thanks for reviewing! You're right, I will change it to use the deprecated method and also deprecate the wrapper class itself.

        Show
        Michael Busch added a comment - Using the deprecated method would have the advantage that it (the whole wrapper class in fact) would have to be removed in 3.0. Thanks for reviewing! You're right, I will change it to use the deprecated method and also deprecate the wrapper class itself.
        Hide
        Michael Busch added a comment -

        Committed with mentioned changes to OldBitSetFilterWrapper.

        Thanks Paul!

        Show
        Michael Busch added a comment - Committed with mentioned changes to OldBitSetFilterWrapper. Thanks Paul!
        Hide
        Paul Elschot added a comment -

        I missed TermsFilter initially, so I had another look there. It could use sth like this:

           /** Provide a SortedVIntList when it is definitely smaller than an OpenBitSet */
           protected DocIdSet finalResult(OpenBitSetDISI result, int maxDocs) {
               return (result.cardinality() < (maxDocs / 9))
                     ? (DocIdSet) new SortedVIntList(result)
                     : (DocIdSet) result;
           }
        

        But that would leave three copies this finalResult method in the patch, which is just beyond my refactoring tolerance level.
        Perhaps this method could move to a static method in the o.a.l.search.DocIdSet class under a better name, sth like defaultDocIdSet(), or into a new helper class o.a.l.util.DefaultDocIdSet, to prepare for the availability of better implementations in the future.

        Show
        Paul Elschot added a comment - I missed TermsFilter initially, so I had another look there. It could use sth like this: /** Provide a SortedVIntList when it is definitely smaller than an OpenBitSet */ protected DocIdSet finalResult(OpenBitSetDISI result, int maxDocs) { return (result.cardinality() < (maxDocs / 9)) ? (DocIdSet) new SortedVIntList(result) : (DocIdSet) result; } But that would leave three copies this finalResult method in the patch, which is just beyond my refactoring tolerance level. Perhaps this method could move to a static method in the o.a.l.search.DocIdSet class under a better name, sth like defaultDocIdSet(), or into a new helper class o.a.l.util.DefaultDocIdSet, to prepare for the availability of better implementations in the future.
        Hide
        Paul Elschot added a comment -

        And in that case, the first argument could also be changed from OpenBitSetDISI to OpenBitSet.

        Show
        Paul Elschot added a comment - And in that case, the first argument could also be changed from OpenBitSetDISI to OpenBitSet.
        Hide
        Michael Busch added a comment -

        Do we actually know about the performance of SortedVIntList?

        I'm a little worried, because it doesn't have a skip list.

        Show
        Michael Busch added a comment - Do we actually know about the performance of SortedVIntList? I'm a little worried, because it doesn't have a skip list.
        Hide
        Paul Elschot added a comment -

        OpenBitSet does not have a skip list either, so I'd expect SortedVIntList to be faster when the underlying set is sparse enough.

        As I missed the commit, I'll provide a patch for my latest comments in the next few days.

        Show
        Paul Elschot added a comment - OpenBitSet does not have a skip list either, so I'd expect SortedVIntList to be faster when the underlying set is sparse enough. As I missed the commit, I'll provide a patch for my latest comments in the next few days.
        Hide
        Michael Busch added a comment -

        I'll provide a patch for my latest comments in the next few days

        Sounds good!

        Show
        Michael Busch added a comment - I'll provide a patch for my latest comments in the next few days Sounds good!
        Hide
        Paul Elschot added a comment -

        While considering DefaultDocIdSet as a class, I thought that perhaps a better way would be to add a method to class Filter that takes the usual DocIdSet and provides the DocIdSet that should be used for caching, for example in CachingWrapperFilter. Sth like this:

        public class Filter {
          ... the abstract bits() deprecated method ... ;
        
          public DocIdSet getDocIdSet(IndexReader reader) {
            // unchanged implementation for now. to become abstract later.
          }
        
          public DocIdSet getDocIdSetForCache(IndexReader reader) {
            // Use a default implementation here that provides a tradeoff for caching,
            // fairly compact when possible, but still fast.
            // For the moment this could be close to the code of the finalResult() method
            // mentioned above:
            DocIdSet result = getDocIdSet(reader);
           
            if (!(result instanceof SortedVIntList)) 
                        and (result.cardinality() < (reader.maxDoc() / 9))) {
              return  new SortedVIntList(result);
            }
            return result;
          }
        

        (One minor problem with this is that DocIdSet does not have a cardinality() method
        and that SortedVIntList does not have a constructor for a DocIdSet.)

        The question is: how about adding such a getDocIdSetForCache() method to Filter?
        Or is there a better place for this functionality, for example in CachingWrapperFilter?

        Show
        Paul Elschot added a comment - While considering DefaultDocIdSet as a class, I thought that perhaps a better way would be to add a method to class Filter that takes the usual DocIdSet and provides the DocIdSet that should be used for caching, for example in CachingWrapperFilter. Sth like this: public class Filter { ... the abstract bits() deprecated method ... ; public DocIdSet getDocIdSet(IndexReader reader) { // unchanged implementation for now. to become abstract later. } public DocIdSet getDocIdSetForCache(IndexReader reader) { // Use a default implementation here that provides a tradeoff for caching, // fairly compact when possible, but still fast. // For the moment this could be close to the code of the finalResult() method // mentioned above: DocIdSet result = getDocIdSet(reader); if (!(result instanceof SortedVIntList)) and (result.cardinality() < (reader.maxDoc() / 9))) { return new SortedVIntList(result); } return result; } (One minor problem with this is that DocIdSet does not have a cardinality() method and that SortedVIntList does not have a constructor for a DocIdSet.) The question is: how about adding such a getDocIdSetForCache() method to Filter? Or is there a better place for this functionality, for example in CachingWrapperFilter?
        Hide
        Paul Elschot added a comment -

        As this has had some time so settle, I think the cache should decide what it wants to store.
        That means I'm in favour of changing CachingWrapperFilter to let it decide which DocIdSet implementation to cache, sth. like this:

        protected DocIdSet docIdSetToCache(DocIdSet dis) { ... }
        

        where dis is the result of getDocIdSet(reader) on the wrapped Filter.
        At the same time the protected finalResult() methods in the contrib code could be removed.

        Comments?

        Show
        Paul Elschot added a comment - As this has had some time so settle, I think the cache should decide what it wants to store. That means I'm in favour of changing CachingWrapperFilter to let it decide which DocIdSet implementation to cache, sth. like this: protected DocIdSet docIdSetToCache(DocIdSet dis) { ... } where dis is the result of getDocIdSet(reader) on the wrapped Filter. At the same time the protected finalResult() methods in the contrib code could be removed. Comments?
        Hide
        Michael Busch added a comment -

        Sounds good to me Paul.

        Could you open a separate issue and attach a patch?

        Show
        Michael Busch added a comment - Sounds good to me Paul. Could you open a separate issue and attach a patch?
        Hide
        Paul Elschot added a comment -

        After the commit here, I'm opening a new issue for filter caching.

        Show
        Paul Elschot added a comment - After the commit here, I'm opening a new issue for filter caching.
        Hide
        Paul Elschot added a comment -

        Just to be complete, the new issue for filter caching is LUCENE-1296

        Show
        Paul Elschot added a comment - Just to be complete, the new issue for filter caching is LUCENE-1296

          People

          • Assignee:
            Michael Busch
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development