|
Iirc the boolean logic on contrib/queries is defined in two places: ChainedFilter and BooleanFilter. Ideally these could me merged and their functions be implemented by the DocIdSetIterators underlying the current scorers used by BooleanScorer2 (Conjunction/Disjunction/ReqOpt/ReqExcl). See also the comments of Micheal Bush and Eks Dev at the end of Jan 2008 at
That's it for the moment. Sorry for the mess, at least it should save others from reviewing all the comments at Paul, I think there is one CHEKME in DisjunctionSumScorer I have stumbled upon recently when I realized
(token1+ token2+) query works way faster than (token1 token2).setMinimumSholdMatch(2). It is not directly related to the also I think there is a hard_to_detect_small_maybe_performance_bug in ConjuctionScorer, : // If first-time skip distance is any predictor of // scorer sparseness, then we should always try to skip first on // those scorers. // Keep last scorer in it's last place (it will be the first // to be skipped on), but reverse all of the others so that // they will be skipped on in order of original high skip. int end=(scorers.length-1)-1; for (int i=0; i<(end>>1); i++) { Scorer tmp = scorers[i]; scorers[i] = scorers[end-i]; scorers[end-i] = tmp; } It has not been detected so far as it has only performance implications (I think?), and it sometimes works and sometimes not, depending on number of scorers: to see what I am talking about, try this "simulator": public static void main(String[] args) { int[] scorers = new int[7]; //3 and 7 do not work for (int i=0; i<scorers.length; i++) { scorers[i]=i; } System.out.println(Arrays.toString(scorers)); int end=(scorers.length-1)-1; for (int i=0; i<(end>>1); i++) { int tmp = scorers[i]; scorers[i] = scorers[end-i]; scorers[end-i] = tmp; } System.out.println(Arrays.toString(scorers)); } for 7 you get: instead of [5, 4, 3, 2, 1, 0, 6] and for 3 Eks,
As both issues you mention are not related to filters, could you open a new issue for each of them? Regards,
Yup, I introduced that feature + bug. I just committed a fix that makes the code do what the comments say (no correctness implications, just perhaps minor performance). Attached javadocsZero2Match.patch that replaces the last few occurrences 'zero scoring' in the java docs of org.apache.lucene.search by 'matching'.
Test that now fails with ChainedFilter.
java.lang.ClassCastException: org.apache.lucene.util.OpenBitSet cannot be cast to java.util.BitSet
at org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:55)
at org.apache.lucene.misc.ChainedFilter.doChain(ChainedFilter.java:258)
at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:193)
at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:156)
at org.apache.lucene.search.Filter.getDocIdSet(Filter.java:49)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:141)
The reason apparently is that the core moved from BitSets to OpenBitSets, If we change the contrib packages to also use OpenBitSets, then this is Maybe we should introduce a DocIdSetFactory in the core? For Michael,
I do not think we need to add Factory (for this particular reason), DocIdSet type should not be assumed as we could come up with smart ways to select optimal Filter representation depending on doc-id distribution, size... The only problem we have with is that contrib classes, ChainedFilter and BooleanFilter assume BitSet. Optimized code for these basic operations already exists, can be copied from Conjunction/Disjunction/ReqOpt/ReqExcl Scorer classes by just simply stripping-off scoring part. with these utility methods in DocIdSet, rewriting ChainedFilter/BooleanFilter to work with DocIdSet (and that works on all implementations of Fileter/DocIdSet) is 10 minutes job... than, if needed this implementation can be optimized to cover type specific cases. Imo, BoolenFilter is better bet, we do not need both of them. Unfortunately I do not have time to play with it next 3-4 weeks, but should be no more than 2 days work (remember, we have difficult part already done in Scorers). Having so much code duplication is not something really good, but we can then later "merge" these somehow. I started adding getDocIdSet() to BooleanFilter of contrib/queries. When trying to collect the interim results into an OpenBitSet I soon needed OpenBitSet.conjunction(DocIdSetIterator), as well as similar disjunction() and exclusion() methods.
Would it be ok to add such methods to Lucene's OpenBitSet, or would it be preferable to subclass OpenBitSet for this? At first sight I prefer subclassing, but I'd like to hear some opinions on this before going further. The OpenBitSetDISI-20080322.patch illustrates my previous question.
DISI means DocIdSetIterator, for want of a better name. The patch compiles, but is untested. For the record: the and() method in the patch misses a last statement: clear(index, size());
BooleanFilter20080325.patch contains a non BitSet version of BooleanFilter. The contrib/queries tests pass, and it includes a finished version of OpenBitSetDISI of a few days ago.
I've also changed the indentation of all of BooleanFilter and added some minor refactorings. This makes the patch itself somewhat less easy to read, but I couldn't leave the indendation in different styles. All tests pass, except the one for ChainedFilter provided here by Mark Miller. Stay tuned.
The Contrib20080325 also includes Mark Miller's test patch, and a non BitSet version of ChainedFilter. It passes the tests contrib/miscellaneous, no time left to run all tests.
Contrib20080325.patch should supersede all patches currently here, except the javadoc patch. There are no type checks yet to use the inter OpenBitSet boolean operations directly, but at least it should work. Remember to profile before adding such optimisations, for sparse OpenBitSets this could well be competitive. Well, I'd hope so. Contrib20080326.patch: supersedes the 20080325 version. Generally the same as yesterday, some extensions:
Both ChainedFilter and BooleanFilter should work normally, except perhaps using less memory because of the SortedVIntList. ChainedFilter still has the 1.1 ASL, it's probably time to upgrade it, but I did not change it in the patch. Thanks for your patches, Paul. I'll be traveling the next days, but I'll try to look at the patches next week.
One thing: I added the max size parameter to the OpenBitSetDISI ctor rather late, so there is probably some room to use more of the fast... bit access methods of OpenBitSet.
Contrib20080427.patch is the same as the previous one from March, except for OpenBitSetDISI: added javadocs there and use fast... bit access methods consistently.
Paul,
Good work. Just tried the patch and ran some pre and post-patch benchmarks. I wanted to measure the overhead of : My test was on the biggest index I have here which was 3 million Wikipedia docs. I had 2 cached TermFilters on very popular terms (500k docs in each) and was measuring the cost of combining these as 2 "shoulds" in a BooleanFilter. The average cost of iterating across BooleanFilter.getDocIdSet() was: old BitSet scheme: 78 milliseconds To address this I tried adding this optimisation into BooleanFilter... DocIdSet dis = ((Filter)shouldFilters.get else { res.inPlaceOr(getDISI(shouldFilters, i, reader)); //your patch code }Before I could benchmark this I had to amend TermsFilter to use OpenBitSet rather than plain old BitSet avg speed of your patch with OpenBitSet-enabled TermFilter : 100 milliseconds I'll try and post a proper patch when I get more time to look at this... Cheers, That sounds like the overhead of the DocIdSetIterator is never more than directly using an OpenBitSet in the dense cases that you tested so far (1 in 6 is more than 1 bit per byte).
That means that a DocIdSetIterator should have acceptable performance in the sparse case when it uses a SortedVIntList underneath. Could you share some test results for sparse cases as well? I'd expect it to outperform OpenBitSet even at CPU time when it is sparse enough. Good benchmarking, Mark.
I'm actually wondering about the performance of BooleanFilter, when used This will happen when users upgrade to Lucene 2.4 (core filters use Your patch, Paul, makes this combination possible, and thus guarantees I would not expect bad performance problems mixing OpenBitSet and BitSet for Filters using this patch, although some performance may be lost. However, if a problem surfaces, the solution is to upgrade the existing Filter from BitSet to OpenBitSet, which amounts to the expected work after deprecation of BitSet in Filter.
A direct implementation of the bit manipulation operations on an OpenBitSet from a BitSet would probably be faster (taking the DocIdSetIterator out of the loop), but at the moment I see no good reason to implement that.
I agree. The patches look good to me, Paul!
I'm attaching a new file that contains both your patches Contrib20080427.patch All unit tests pass, however I think they don't cover all code paths now. We Paul, maybe you can help me with adding those tests? Then I would go ahead With the size tests added in OpenBitSetDISI, the javadocs of the changed methods could also be relaxed.
The filter tests in the contrib modules misc and queries look fairly complete to me in their current state. Did I overlook anything there? I don't have a coverage test tool here. I added a test helper class called OldBitSetFilterWrapper that helps to
test compatibility with old filters based on BitSets. I changed ChainedFilterTest and BooleanFilterTest to run all tests on I verified with a code coverage tool that now all those paths are covered Would be nice if you could quickly review the patch, Paul. If you're ok Thanks for the backward compatibility additions in the tests.
One nice little detail: the patch contains this in OldBitSetFilterWrapper: + BitSet bits = new BitSet(reader.maxDoc());
+ DocIdSetIterator it = filter.getDocIdSet(reader).iterator();
...
but I expected: BitSet bits = filter.bits(reader); // use deprecated method
On old filters both will versions will work, but the alternative makes it explicit that the filter must be an 'old' one.
Thanks for reviewing! You're right, I will change it to use the deprecated method and also deprecate the wrapper class itself. Committed with mentioned changes to OldBitSetFilterWrapper.
Thanks Paul! I missed TermsFilter initially, so I had another look there. It could use sth like this:
/** Provide a SortedVIntList when it is definitely smaller than an OpenBitSet */ protected DocIdSet finalResult(OpenBitSetDISI result, int maxDocs) { return (result.cardinality() < (maxDocs / 9)) ? (DocIdSet) new SortedVIntList(result) : (DocIdSet) result; } But that would leave three copies this finalResult method in the patch, which is just beyond my refactoring tolerance level. And in that case, the first argument could also be changed from OpenBitSetDISI to OpenBitSet.
Do we actually know about the performance of SortedVIntList?
I'm a little worried, because it doesn't have a skip list. OpenBitSet does not have a skip list either, so I'd expect SortedVIntList to be faster when the underlying set is sparse enough.
As I missed the commit, I'll provide a patch for my latest comments in the next few days.
Sounds good! While considering DefaultDocIdSet as a class, I thought that perhaps a better way would be to add a method to class Filter that takes the usual DocIdSet and provides the DocIdSet that should be used for caching, for example in CachingWrapperFilter. Sth like this:
public class Filter { ... the abstract bits() deprecated method ... ; public DocIdSet getDocIdSet(IndexReader reader) { // unchanged implementation for now. to become abstract later. } public DocIdSet getDocIdSetForCache(IndexReader reader) { // Use a default implementation here that provides a tradeoff for caching, // fairly compact when possible, but still fast. // For the moment this could be close to the code of the finalResult() method // mentioned above: DocIdSet result = getDocIdSet(reader); if (!(result instanceof SortedVIntList)) and (result.cardinality() < (reader.maxDoc() / 9))) { return new SortedVIntList(result); } return result; } (One minor problem with this is that DocIdSet does not have a cardinality() method The question is: how about adding such a getDocIdSetForCache() method to Filter? As this has had some time so settle, I think the cache should decide what it wants to store.
That means I'm in favour of changing CachingWrapperFilter to let it decide which DocIdSet implementation to cache, sth. like this: protected DocIdSet docIdSetToCache(DocIdSet dis) { ... }
where dis is the result of getDocIdSet(reader) on the wrapped Filter. Comments? Sounds good to me Paul.
Could you open a separate issue and attach a patch? After the commit here, I'm opening a new issue for filter caching.
Just to be complete, the new issue for filter caching is
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I'm sorry about that, I'll just continue here, when it gets too messy, another jira issue can easily be opened.