|
thanks, this looks interesting.
Regards, As the title of this issue is as accurate as it gets, I'm attaching a series of patches and additions here that make Scorer a subclass of Matcher, while Matcher takes the current role of the BitSet in Filter.
All patches against trunk revision 417299. javadocs of HitCollector.java to use 'matching' instead of 'non-zero score'.
This is actually independent of the Matcher/Scorer change. javadocs of Searcher.java to use 'matching' instead of 'non-zero score',
and to describe the Filter effect more accurately. This is independent of the Matcher/Scorer change. MatchCollector.java with collect(int) method for org.apache.lucene.search.
Matcher.java, including a match(MatchCollector) method, for org.apache.lucene.search.
patch to Scorer.java to subclass Matcher.
patch to Filter to add getMatcher() and to deprecate getBits() in favour of getMatcher().
Includes commented test code to test IndexSearcher using BitsMatcher. Patch to IndexSearcher.java to prefer getMatcher() over getBits() on Filter.
Also add method IndexSearcher.match(Query, MatchCollector). A Matcher constructed from a BitSet for org.apache.lucene.util.
SortedVIntList.java for org.apache.lucene.util superseding the one in
TestSortedVIntList.java, superseding the one in
I hope I got all the attachments right, please holler in case something does not patch or compile cleanly.
Some questions/remarks:
Regards, I've started to improve the javadocs of almost all code posted here, so it's probably not worthwhile to commit this as it is now.
I don't expect changes to the java code in the short term. Regards, Patches against trunk revision 417683, current.
Compared to previous patches/files, there are only javadoc updates, and the javadocs of Searchable are also patched. Added some javadocs to BitsMatcher.
Added Matcher constructor to SortedVIntList, and extended the test for this. Perhaps the BitsMatcher class is better implemented as a static method in a new class MatcherUtils.createMatcher(BitSet).
Similar methods could be added for OpenBitSet, SortedVIntList, int[] and whatever data structure comes around for implementing Filter.getMatcher(IndexReader). When Matcher is a superclass of Scorer, TermScorer already implements a Matcher for TermDocs. As requested on java-dev, Matcher20060830.patch is the whole thing as a single patch, relative to srv/java/org/apache/lucene in the trunk, revision 438598 of 30 August 2006.
This does not contain the FIXME the earlier posted Filter-20060628.patch . Regards, Thanks Paul,
I like the Matcher/Scorer relation. It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right? When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs to be removed, or a check done in IndexSearcher.search() to disable skipping on the scorer if it's in use. I wonder what the performance impact is... for a dense search with a dense bitset filter, it looks like quite a bit of overhead is added (two calls in order to get the next doc, use of nextSetBit() instead of get(), checking "exhausted" each time and checking for -1 to set exhausted). I suppose one can always drop back to using a HitCollector for special cases though. Matcher20060830b.patch corrects 2 mistakes in Matcher20060830.patch:
Searchable.java was present twice, and TestSortedVIntList was not present. Thanks eks for pointing out the mistake. This patch was generated from the trunk directory. Regards, Yonik, as to you questions:
> It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right? Right. Only the previous Filter-20060628.patch contains some commented FIXME code to actually introduce a BitsMatcher in each case where a BitSet is used. >When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs Iirc the patch still supports the 1.4 BooleanScorer when a BitSet is returned by Filter. I'd have to have a look at the patched IndexSearcher to be sure though. > I wonder what the performance impact is... for a dense search with a dense bitset BitsMatcher could also work without the "exhausted" flag, but then an infinite loop Regards, using the latest Matcher20060830.patch
ant said "BUILD SUCCESSFUL" I will see how it works on some real life cases using our 50Mio collection and report back what our standard app level tests have to say (we have standardized collection /requsts/hits so bad things should pop -up quickly). Need a day or two for this. thanks for this work. Hi Paul,
for me, this patch did not cause any incompatibility issues. All our tests passed without noticing any difference to the previous trunk version. No performance changes as well ( we use HitCollector only, so Yoniks comment does not apply here). Tests are application level, and make index hot (6hrs searches with test batch of requests with known responses), 50Mio not artificial docs, real requests... Early this week we will try to implement our first Matchers and see how they behave > No performance changes as well.
It's good to hear that. As mentioned earlier, this is groundwork only. Which comment of Yonik related to HitCollector do you mean? > Early this week we will try to implement our first Matchers and see how they behave BitsMatcher and SortedVIntList could start that. Paul,
What is exact semantics of skipTo(int) in Matcher?
Is this intended behavior, "skip forward until you reach end, and then, you are at the end Here are some Matcher implementations,
-DenseOpenBitsMatcher - Using solr BitSetIterator (for skipTo() to work, one method in BitSetIterator should become public) Also attached one simple test (just basic fuctionality) that also contains one dummy relative performance test Perf. test simply iterates over different Matcher implementations and measures ellapsed time (not including Matcher creation, pure forward scan to the end) for different set bit densities. imho, this code is not sufficiantly tested nor commented, needs an hour or two. As expected, Yonik made this ButSetIterator really fast. What was surprise for me was OpenBitSet nextSetBit() comparing bad to the BitSet (or I made some dummy mistake somewhere?) Paul,
What is next now, we did on our app enough experiments and are now sure that this patch causes no incompatibilities. We also tried to replace our filters with OpenBitSet and VInt matchers and results there are more than good, our app showed crazy 30% speed-up!!! Hard to identify where from exactly, but I suspect VInt matcher in case of not too dense BitVectors increased our Filter Cache utilization significantly. I would propose to commit this patch before we go further with something that would actually utilize Matcher. Just to avoid creating monster patch on patch ... This is ground work, and now using Matcher will be pure poetry, I see a lot of places we could see beter life by using use Matchers, ConstantScoringQuery, PreffixFilter, ChainedFilter (becomes obsolete now)... actually replace all uses of BitSet with OpenBitSet (or a bit smarter with SortedIntList. VInt...)... Do you, or someone else see now things to be done before commiting this? > Do you, or someone else see now things to be done before commiting this?
Yes. In the steps listed here: Paul Elschot. In the inheritance from Matcher to Scorer there is an asymmetry
in this patch. Matcher provides a default implementation for Matcher.explain() With matcher implementations quite a few other implementation Regards, I wrote:
> One could add an abstract Scorer.explain() to catch these, or by mistake. The good news is that the patch leaves the Corrected javadoc refs to use IndexInput and IndexOutput.
I have just resolved some minor local conflicts on the updated copyrights of four java files.
Please holler when a fresh patch is needed. As 2.1 is out, here is a new patch to try and revive this.
This replaces the pevious Matcheryyyymmdd.patch one of 2006 It's been a while since i looked at this issue, but it's come up in discussion recently so i took another glance...
Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses it to decide whether or not to iterator over the (non null) Matcher, or over the BitSet from Filter.bits. is there any reason that logic can't be put in getMatcher, so that if subclasses of Filter don't override the getMatcher method it will call bits and then return a Matcher that iterates over the set Bits? (this is the roll-out approach i advocated a while back when discussing this on email, excecept that at the time Matcher was refered to as SearchFilter: http://www.nabble.com/RE%3A-Filter-p2605271.html Thinking about it now, we could even change Filter.bits so it's no longer abstract ... it could have an implementation that would call getMatcher, and iterate over all of the matched docs setting bits on a BitSet that is then returned ... the class would still be abstract, and the class javadocs would make it clear that subclasses must override at least one of the methods ... legacy Filters will work fine because they'll already have a bits method, and people writing new Filters will see that bits is deprecated, so they'll just write a getMatcher method and be done. This appears to be the same approach taken with Analyzer.tokenStream back in 1.4.3... http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/analysis/Analyzer.html Hoss,
>Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses Two reasons:
There is also > we could even change Filter.bits so it's no longer abstract ... it could have I must say that creating a BitSet from a Matcher never occurred to me. Paul:
Applied the patch, applied cleanly, run ant test -> BUILD SUCCESSFUL I'm primarily interested in using this in order to get matches, but avoid scoring. From what I can tell, I'd just need to switch to using the new match(Query, MatchCollector) method in IndexSearcher. However, I need Sort and TopFieldDocs, and I don't see a match method with those. Is there a reason why such a match method is not in the patch? Otis:
> However, I need Sort and TopFieldDocs, and I don't see a match method with those. A bit silly perhaps, but what sort criterion would like to have used when no score() value is available? I don't know the sorting code, but it might be possible to use a field value for sorting. > BitsMatcher could also work without the "exhausted" flag, but then an infinite loop
> might occur when trying to continue after the first time next() or skipTo() returned false. > Continuing after false was returned in these cases is a bug, however an infinite loop > can be difficult to debug. I'd rather be on the safe side of that with the exhausted flag and wait for an actual > profile to show the performance problem. We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed. > We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed.
I fully agree. The "exhausted" flag is not much more than a matter of taste. In case the speed advantage of removing this flag is preferred, I don't mind resolving the eventual conflict in my working copy. > In case the speed advantage of removing this flag is preferred, I don't mind resolving the eventual conflict in my
> working copy. But I don't know yet how I would resolve that conflict Ah, that's a bit different. You mean there are cases that are non-trivial to fix where next() is called after false is returned? Perhaps I did something wrong with the benchmark, but I didn't get any speed-up when using searcher.match(Query, MatchCollector) vs. searcher.search(Query, HitCollector).
Here are the benchmark numbers (50000 queries with each), HitCollector first, MatchCollector second: HITCOLLECTOR: [java] ------------> Report Sum By (any) Name (11 about 41 out of 41) [java] ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41) MATCHCOLLECTOR: [java] ------------> Report Sum By (any) Name (11 about 41 out of 41) [java] ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41) This is what I did for the benchmark. I used Doron's handy conf/benchmark. $ diff conf/micro-standard.alg conf/matcher-micro-standard.alg Then I added 2 new Tasks for benchamrking the Matcher (searcher.search(Query, MatchCollector)) and modified the ReadTask to call searcher.search(Query, HitCollector) instead of the method to get Hits. I commented out all search results traversal and doc retrieval, as I didn't care to measure that. The benchmark does not search with filters. Is any speedup still expected? (why?)
I applied the patch on current trunk and ran the benchmark - it shows that when all queries use the same reader, Match is faster while when each query opens its own reader bitset is faster. Is this an expected result? Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem SrchMtchSamRdr_5000 - 10 5000 642.2 77.85 12,331,866 16,408,576 SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 586.9 - - 85.20 - 9,515,875 - 12,009,472 SrchMtchNewRdr_500 - 10 500 134.7 37.11 13,376,113 17,171,660 This test is using all Reuters documents and the searches rounds are repeated 10 times. The Match tasks were not included so I wrote them. The updated bench-diff.txt attached contains these task classes and the algorithm. (When you use this, note that once the index is created you can comment the first part - the "Populate" part - and then only rerun the querying part.) One line was cut out - here are the four lines again
Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem Doron, thanks for jumping on this!
1. I thought I'd see better performance with the Matcher because it skips scoring. While Paul's patch does make changes to the Filtering code, I'm more focused on HitCollector vs. MatchCollector performance here. Am I missing something here? If scoring is skipped, we should see at least some speed improvement, and your results show that. 2. You said you did see MatchCollector was faster than HitCollector. Hmmm, weird, not in my 4 runs: Matcher: I'll try it again on a different computer. My previous runs were on a Mac with OSX. 3. My bench-diff.txt did include Match tasks: $ grep Match bench-diff.txt | grep class ... but I didn't svn add them, so I produced the "diff" by simply cat-ing the new tasks to bench-diff.txt . So if you used my bench-diff.txt as a patch, it wouldn't have worked. Not a big deal, just clarifying. ...right, your diff-txt had the Match tasks - I missed that - checked it, it is exactly what I did, so we're ok here.
When you rerun, you may want to use my alg - to compare the two approaches in one run. You can run this by something like: Also, to get cleaner results, add the line: I tried like this twice, and got inconsistent results: When the bitset searches preceded the match searches: When the match searches preceded the bitset searches: My conclusion from this is that the speed-up, if exists, is minor, at least for the setup of this test. There are only 15 unique queries in this test - also printed in the log - are these the queries you would expect to save in? I didn't follow this issue very closely so I don't know where the saving is expected here. Both SearchTask and MatchTask now do nothing in collect, so no difference at the actual collect() call. Also, Scorer.score(HitCollector) and Matcher.match(MatchCollector) are very similar: I think there is a potential gain for large boolean OR queries, because score() would have to call next() on all TermScorers and collect/sum their scores, while match() could use skipTo(last+1) because any match encountered is a match and there is no need to sum the individual scores for the same doc by other scorers. However as far as I can tell, current match() implementation does not take advantage of this, but I may be overlooking something? > When you rerun, you may want to use my alg - to compare the two approaches in one run.
This is more dangerous though. GC from one method's garbage can penalize the 2nd methods performance. The hotspot effect can be minimized by running the test multiple times in the same JVM instance and discarding the first runs, but it's not so easy for GC. Instead of discarding the first run, the approach I usually take is to run 3-4 times and pick the minimum. You can then run several of these "sets" and average over the minimum of each. GC is still an issues, though. It is hard to get around when it is a mark&sweep collector (reference counting is much friendlier in this regard)
> > When you rerun, you may want to use my alg - to compare the two approaches in one run.
> This is more dangerous though. Agree. I was trying to get rid of this by splitting each round to 3: - gc(), warm(), work() - when work() and warm() are the same, just that warm()'s stats are disregarded. Still switching the order of "by match" and "by bits" yield different results. Sometimes we would like not to disregard GC - in particular if one approach is creating more (or more complex) garbage than another approach. Perhaps we should look at two measures: best & avg/sum (2nd ignoring first run, for hotspot). Doron: just to address your question from Apr/7 - I expect/hope to see an improvement in performance because of this difference:
hc.collect(doc(), score()); the delta being the cost of the score() call that does the scoring. If I understand things correctly, that means that what grant described at the bottom of http://lucene.apache.org/java/docs/scoring.html > No Scorer, no BooleanScorer(2), no ConjunctionScorer...
Thanks, I was reading "score" instead of "score()"... But there is a scorer in the process, it is used for next()-ing to matched docs. So most of the work - preparing to be able to compute the scores - was done already. The scorer doc queue is created and populated. Not calling score() is saving the (final) looping on the scorers for aggregating their scores, multiplying by coord factor, etc. I assume this is why only a small speed up is seen. Ahhhh. I'll look at the patch again tomorrow and follow what you said. All this time I was under the impression that one of the points or at least side-effects of the Matcher was that scoring was skipped, which would be perfect where matches are ordered by anything other than relevance.
DisjunctionSumScorer (the ORScorer) actually calls Scorer.score() on all of the matching scorers in the ScorerDocQueue during next(), in order to accumulate an aggregate score. The MatchCollector can't save you from that.
That could be improved in a DisjunctionMatcher.
With a bit of bookkeeping DisjunctionSumScorer could also delay calling score() on the subscorers but the bookkeeping would affect performance for the normal case. For the usual queries the score() call will never have much of a performance impact. To determine which documents match the query, the index need to be accessed, Ah, too bad.
Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel boxes, Intel boxes with Linux on them actually won, and my impression was that this was due to Niagara's weak FPU (a known weakness in Niagara, I believe). Thus, I thought, if we could just skip scoring and various floating point calculations, we'd see better performance, esp. on Niagara boxes. Paul, when you say "fastest cache", what exactly are you referring to? The Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at least the JVM had plenty of RAM to work with. By fastest cache I meant the L1 cache of the processor. The size is normally in tens of kilobytes.
An array lookup hitting that cache takes about as much time as a floating point addition. During a query search the use of a.o. the term frequencies, the proximity data, and the document weights normally cause an L1 cache miss. I would expect that by not doing the score value computations, only the cache misses for document weights can be saved. I'm a little behind on following this issue, but if i can attempt to sum up the recent discussion about performance...
"Migrating towards a "Matcher" API may allow some types of Queries to be faster in situations where clients can use a MatchCollector instead of a HitCollector, but this won't be a silver bullet performance win for all Query classes – just those where some of the score calculations is (or can be) isolated to the score method (as opposed to skipTO or next)" I think it's important to remember the motivation of this issue wasn't to improve the speed performance of non-scoring searchers, it was to decouple the concept of "Filtering" results away from needing to populate a (potentially large) BitSet when the logic neccessary for Filtering can easily be expressed in terms of a doc iterator (aka: a Matcher) – opening up the possibility of memory performance improvements. A second benefit that has arisen as the issue evolved, has been the API generalization of the "Matcher" concept to be a super class of Scorer for simpler APIs moving forward. Right. I was under the wrong impression that the Matcher also happens to avoid scoring. However, now that we've all looked at this patch (still applies cleanly and unit tests all pass), and nobody had any criticisms, I think we should commit it, say this Friday.
As I'm in the performance squeezing mode, I'll go look at I'm still behind on following this issue, but Otis: if you are interested in moving forward with this, you might consider trying the cahnges i proposed in my "15/Mar/07 11:06 AM" Comment...
https://issues.apache.org/jira/browse/LUCENE-584#action_12481263 ...I think it would keep IndexSearcher a little cleaner, and make it easier for people to migrate existing Filter's gradually (without requiring extra work for people writing new "Matcher" style Filters from scratch) With 2.2 out, and
The patch still applies cleanly, and I'd like to start working on a skipping extension of SortedVIntList, much like the latest index format for document lists. This DefaultMatcher2007072.patch adds a default Matcher to be used in Filter instead of the BitSet . It contains static methods that create a default Matcher from a BitSet and from an OpenBitSet. The patch also add OpenBitSet to org.apache.lucene.util; it was taken from a recent solr revision.
In this way the deprecation of Filter.bits(IndexReader) can be done by replacing implementations of that method by Filter.getMatcher(IndexReader) and adding the above default Matcher in the return statement: The idea is to have this hook available so that a sensible default Matcher is easily available, that can also be adapted to use better Matcher implementations when these become available. I have begun introducing the default matcher in my working copy of the core, but as expected, that turns out to be quite a bit of work. There is some code in contrib where a Filter is assumed to have BitSet available:
contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java When Filter is going to move from BitSet to Matcher, these will have to be reimplemented. The whole thing in three patches:
The Matcher-ground patch is the groundwork, which should be very similar to the earlier groundwork patch. The Matcher-default patch provides a default implementation, the same as the one I posted earlier today. Among others, It includes OpenBitSet from solr in org.apache.lucene.util . The Matcher-core uses the default implementation inside the rest of the lucene core and test code. It replaces Filter.bits() methods with Filter.getMatcher() methods in the subclasses of Filter. All core tests pass with these patches aplied., except the one below. [junit] Testsuite: org.apache.lucene.search.TestRemoteCachingWrapperFilter Finally, contrib may not even compile with the patches applied. Hi Paul,
Not sure if I'm missing something but I think this patch may not work for scenarios other than the simple option of a single filter being used on a search. A Matcher does not have the same utility as a BitSet because using a BitSet you can: 1) iterate across it using multiple threads. I think these differences become important in the following scenarios : In CachingWrapperFilter I don't think you can cache Matchers instead of bitsets - because Matchers don't have features 1 and 4 BooleanFilter and ChainedFilter in contrib don't work with Matchers because there is no support for 3) Is there something obvious I've missed? Cheers Have a look at BitSetMatcher in the -default patch. It is constructed from a BitSet, and it has a method getMatcher() that returns a Matcher that acts as a searching iterator over the BitSet.
So that is 1) to 4), at least potentially. A clone() method is currently not implemented iirc, but each call to getMatcher() will return a new iterator over the underlying BitSet. And when guaranteed non modifyability is needed, a constructor can take a copy of the given document set, in whatever form. The point of Matcher is that it allows other implementations than BitSet, like OpenBitSet and SortedVIntList. Both have the properties that you are looking for. SortedVIntList can I'd like to have a skip list version of SortedVIntList, too. This would be slightly larger than SortedVIntList, but more efficient on skipTo(). But the first thing that is necessary is to have Filter independent from BitSet. The real pain with that is going to be the code that currently implements Filters The default implementation will probably need to be improved from its current Cheers, I forgot to mention that boolean logic on Matchers is already in present in BooleanScorer2.
This is because each Scorer is a Matcher. Thanks for the reply, Paul.
I saw BitSetMatcher etc and appreciate the motivation behind the design for alternative implementations . What concerns me with the Matcher API in general is that Matchers have non-threadsafe safe state (i.e. the current position required to support next() )and as such aren't safely cachable in the same way as BitSets. I see the searcher code uses the safer skipTo() rather than next() but there's still the "if(exhausted)" thread safety problem to worry about which is why I raised points 1 and 4. Additionally, combining Bitsets using Booolean logic is one method call whereas combining heterogenous Matchers using Boolean logic requires iteration across them and therefore potentially many method calls (point 3). I haven't benchmarked this but I imagine it to be significantly slower? Cheers Mark,
The exhausted flag is only in the iterator/Matcher, not in the underlying set data structure. One can use as many iterators as necessary, for example one per thread, and then there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.) You wrote: I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations. That can still be done, but one needs to get to the BitSets for example by caching them outside the Filters and constructing the resulting BitSetMatcher for the combined Filter on the fly. An alternative would be to have a BooleanQuery.add(Matcher, Occur), where the occurrence can only be required or prohibited. Then there is no need to construct any resulting filter because the boolean logic will be executed during the search. This might even be more efficient than combining the full BitSets ahead of the search. And with many large BitSets cache memory savings from more compact implementations can also be helpful. Mark,
An easy way to keep things like BooleanFilter working could be to That would only involve some deprecation warnings in BitsFilter for the period I would not even mind cooking this up as patch to contrib. Thoughts? A different take in the patches of 20070730.
In this version class Filter has only one method: Class BitSetFilter is added as a subclass of Filter, and it has the familiar In the ..core.. patch, and in the ..contrib.. patches (to follow), most uses of Filter simply replaced by BitSetFilter. This turned out to be an easy way of dealing This change to Filter and its replacement by BitSetFilter could well be taking The ..default.. patch contains a default implementation of a Matcher from a BitSet, and it has OpenBitSet and friends from solr, as well as SortedVIntList as posted earlier. Some 20070730 patches to contrib using BitSetFilter.
The contrib-misc and contrib-queries patches are reasonbly good, their tests pass and replacing Filter by BitSetFilter is right for them. However, I'm not happy with the contrib-xml patch to the xml-query parser. Uploading the patches again, this time with the ASF license.
Some more remarks on the 20070730 patches.
To recap, this introduces Matcher as a superclass of Scorer to take the role that BitSet currently has in Filter. The total number of java files changed/added by these patches is 47, so some extra care will be needed. What approach should be taken for the API change to Filter (see above, 2 comments up)? I'd like to get all test cases to pass again. TestRemoteCachingWrapperFilter still does not pass, and For xml-query-parser in contrib I'd like to know in which direction to proceed (see 1 comment up). The ..default.. patch has taken OpenBitSet and friends from solr to have a default implementation. Once these issues have been resolved far enough, I would recommend to introduce this shortly after a release so there is some time to let things settle. Hi Paul,
Many thanks for your responses. Sorry for the delay in communications - just got back from 2 weeks holiday and slowly picking my way through this patch. You said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)" Did you mean BitSetFilter.getMatcher()? BitSetMatcher has no getMatcher method. If so, doesn't my original thread safety issue still stand? - CachingWrapperFilter is caching Matchers (not Filters which are factories for matchers). The existing approach of adding a <CachedFilter> tag around my XML-based query templates offers a major speed up in my applications and I don't see this supported in this patch currently which gives me some concern. This existing caching technique is based on the use of CachingWrapperFilter. The proposed framework seems to be missing a means of caching reusable, threadsafe Matchers in a type-independent fashion. One solution (which I think you may be suggesting with the "getMatcher" comment) is to cache Filter objects and use Filter.getMatcher(reader) as a factory method for thread-specific, single-use Matchers but this would suggest that any caching then becomes an implied responsibility/overhead of each Filter implementation. Not too great. CachingWrapperFilter is an example of a better design where the caching policy has been implemented in a single class and it can be used to decorate any Filter implementation (RangeFilter etc) with the required caching behaviour. Unfortunately with this proposed patch there is no way that any such single caching policy can work with any Filter because Matcher is not reusable/cachable. Time to remove any thread-specific state from Matcher? Cheers Some further thought on the roles/responsibilities of the various components:
Given a blank sheet of paper (a luxury we may not have) the minimum requirements I would have could be met with the following: interface DocIdSetFactory interface DocIdSet interface DocIdSetIterator In the existing proposal it feels like DocIdSet and DocIdSetIterator are rolled into one in the form of the Matcher which complicates/prevents caching strategies. Cheers Mark,
I said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)" A Matcher is intended to be used in a single thread, so I don't expect thread safety problems. The problem for the XML parser is that with this patch, the implementing data structure of a Filter becomes There is indeed no caching of filters in this patch. As for DocIdSet, as long as this provides a Matcher as an iterator, it can be used to I don't think this patch complicates the implementation of caching strategies. I did consider defining Matcher as an interface, but I preferred not to do that because Hi Paul,
Not sure we've reached a common understanding here yet. You said "That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList has a getMatcher() method, and I confused the two. " I too made a mistake in my earlier comments. DocIdSetIterator does NOT have "probably one implementation". There would be an implementation for each different type of DocIdSet (Bitset/OpenBitSet/VIntList). You said "some Filters do not need a cache. For example: TermFilter". I'm not sure why that has been singled out as not worthy of caching. I have certain terms (e.g. gender:male) where the TermDocs is very large (50% of all docs in the index!) so multiple calls to TermDocs for term "gender:male" (if that is what you are suggesting) is highly undesirable. These are typically handled in the XMLQueryParser using syntax like this: You said: "CachingWrapperFilter could then become a cache for BitSetFilter. " If this is undesirable (my guess is "yes") then the proposal in my previous comment is a solution which allows for caching of any/all types of the new sets (openBitSet,BitSet,SortedVIntList etc) Regardless of my choice of class names or decisions over interfaces vs abstract classes do you not at least agree the need for 3 types of functionality: 1) A factory for instantiating sets of document ids matching a particular set of criteria (which can be costly to call). While the factory is not expected to implement a caching strategy it is expected to implement hashcode/equals simply to aid any caching services which would need this help to identify previously instantiated sets which share the same criteria as ant new requests (This service I identified as my "DocIdSetFactory" and TermsFilter/RangeFilter would be example implementations). Each type of functionality can have different implementations so the functionality must be defined using an interface or abstract class. Cheers Mark,
I think we are one the same line, it's just that I don't want to go that far now. And that also makes it a starting point for caching of different data structures for Filters. The DefaultMatcher class tries to do some compression by using a SortedVIntList when that is smaller than a BitSet, and that is about as far as I'd like to go now. Proost, OK, I appreciate caching may not be a top priority in this proposal but I have live systems in production using XMLQueryParser and which use the existing core facilities for caching. As it stands this proposal breaks this functionality (see "FIXME" in contrib's CachedFilterBuilder and my concerns over use of unthreadsafe Matcher in the core class CachingWrapperFilter)
I am obviously concerned by this and keen to help shape a solution which preserves the existing capabilities while adding your new functionality. I'm not sure I share your view that support for caching can be treated as a separate issue to be dealt with at a later date. There are a larger number of changes proposed in this patch and if the design does not at least consider future caching issues now, I suspect much will have to be reworked later. The change I can envisage most clearly is expressed in my concern that the DocIdSet and DocIdSetIterator services I outlined are being combined in Matcher as it stands now and these functions will have to be separated. Cheers I, unfortunately, haven't had the time to read through everything in the latest patches, but catching up on my jira mail one of Paul's comments jumped out at me, so i wanted to make sure it's completley clear: this latest set of patches completely breaks backwards compatibility for any clients who have Filter subclasses, or methods that take a Filter as a param, since the Filter class now has an abstract getMatcher method and no longer supports an abstract BitSet method – presumably the expectation being that all client code should have a search/replace done from Filter=>BitSetFilter
which begs the question: why not eliminate BitSetFilter and move it's getMatcher impl to the Filter class? (if the concern is just that there be a "higher level" class in which both methods are abstract, why not insert a parent with some new name above the Filter class?) For the record: it really bothers me that the old attachments got deleted ... the inability to refresh my memory by looking at the older patches and compare them with the current patches is extremely frustrating This set of patches indeed break backward compatibility with the current Filter class.
That was done to show the ideal end situation, and to make sure that the patched code is indeed there. To get backward compatibility I'd prefer to temporarily copy the functionality from BitSetFilter into the Filter class, while still leaving BitSetFilter as it is: public class Filter { /** this method will become abstract once the bits() method is removed from Filter: */ The main difference between the current set of patches and the removed patches is indicated Another way to decouple from BitSet would be to keep introduce a new superclass of Filter that only has an abstract getMatcher() method, and to add an implementation of that method in the current Filter class.
That would boil down to the current patch with two classes renamed: Filter -> new class with abstract getMatcher() method. BitSetFilter -> Filter. This would avoid all backward compatibility issues, except for the unlikely case in which a getMatcher() method is already implemented in an existing subclass of Filter. For the lucene core and some of the contrib, this would mean that it would move to this new superclass of Filter. Again, I don't expect backward compatibility issues there. Does anyone see any problems with this approach? This time (20070905), as indicated in the previous post, a set of patches that add MatchFilter as the new superclass of Filter. Backward compatibility is quite good, no changes at all are necessary in contrib.
In the 1ground patch, the current core API is moved from Filter to MatchFilter. Since Filter is a subclass of MatchFilter, I do not expect backward compatibility issues with this, but it is a quite extensive API change. In the 2default patch, some support for MatchFilter caching was added in classes DefaultMatcher and MatcherProvider. OpenBitSet and some support for that was added from solr here, but OpenBitSet is not used (yet). SortedVIntList is also added, and this is used for caching in CachingWrapperFilter as below. In the 3core patch, this caching support is used in CachingWrapperFilter. See also java-dev of yesterday and today for a fixed thread safety problem there. I also added the Apache Licence to all new files. All tests pass with the patches applied, core and contrib. These 3 patches modify 35 source code files, so please tread carefully. They were generated against revision 573048. I will remove my previous patch set in a week or so. I don't think
public interface AbstractBitSet is according to standards. It should just be public interface BitSet possibly public interface IBitSet if coming from the Windows world. Since it is in a different package, there is no collision with the standard BitSet class. The posted patch proposes to use this class to determine which documents should be filtered:
public abstract class Matcher { This class is then used as a superclass of org.apache.lucene.search.Scorer. As the current patch set is large, I've been pondering how to do this in a series of smaller patches that can each be applied by itself. This is possible in the following way:
1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently. Step 1 can be reasonably done before a new a release. After that three (almost) independent paths can be followed: My question is: shall I go ahead and provide a patch for step 1? At the moment I'm refining BooleanScorer2. to use Matcher. This is for the case of multiple prohibited clauses, and also to allow the use of required and prohibited Matchers to allow adding filtering clauses to BooleanQuery. This Matcher-20071008-1ground.patch replaces the previous version because in between there was a conclict with the javadocs of Scorer for document ordering.
In today's version, Scorer is unchanged, except for the superclass Matcher, and Matcher reuses the javadocs of Scorer as much as possible. Attached once more, this time with licence granted to ASF.
Resolved a local conflict in the javadocs of HitCollector.
Resolved a local conflict in the javadocs of HitCollector. This time with licence granted to ASF.
Paul, I like the iterative plan you suggested. I started reviewing the
The patch is backwards compatible, except for current subclasses of Filter already have a getMatcher method. The fact that no changes are needed to contrib confirms the compatibility.
I have made no performance tests on BitSetMatcher for two reasons. DefaultMatcher should be in the ...2default... patch. So far, I never tried to use these patches on their own, I have only split them for a better overview. Splitting the combined patches to iterate would need a different split, as you found out. It might even be necessary to split within a single class, but I'll gladly do that. OK, here's a patch that compiles cleanly on current trunk and all tests
pass. It includes:
Would you be up for providing testcases? As I said I haven't fully reviewed the patch, but I'm planning to do With the full patch applied, the following test cases use a BitSetMatcher:
TestQueryParser so I don't think it is necessary to provide seperate test cases. Yes you're right, I ran the tests w/ code coverage analysis enabled, and the
BitSetMatcher is fully covered. Good!
I think that custom Searcher or Searchable implementations won't compile anymore? @@ -86,13 +86,14 @@
* <p>Called by {@link Hits}.
*
* <p>Applications should usually call {@link Searcher#search(Query)} or
- * {@link Searcher#search(Query,Filter)} instead.
+ * {@link Searcher#search(Query,MatchFilter)} instead.
* @throws BooleanQuery.TooManyClauses
*/
- TopDocs search(Weight weight, Filter filter, int n) throws IOException;
+ TopDocs search(Weight weight, MatchFilter filter, int n) throws IOException;
I tried implementing a Searchable, and indeed ran into compilation errors.
So, backward compatibility is indeed not complete. Also, Searchable is an interface, so it should not be changed. There are two ways out of this: Do a name change on MatcherFilter/Filter -> Filter/BitSetFilter. I'd prefer to add an interface (or abstract class?) like Searchable that uses MatchFilter for those implementers that want to take advantage of MatchFilter. Why do we actually need the new MatchFilter class at all?
Filter is an abstract class, not an interface. So I think we could simply add the new method getMatcher() like you already did in your patch: /** * @return A Matcher constructed from the provided BitSet. * @see DefaultMatcher#defaultMatcher(BitSet) */ public Matcher getMatcher(IndexReader reader) throws IOException { return new BitSetMatcher(bits(reader)); } This shouldn't break existing Filter implementations? For example, OpenBitSetFilter would like this:
class OpenBitSetFilter /* ... */ {
OpenBitSet bits(reader) { ... }
Matcher getMatcher(reader) { ... }
}
Since the only thing needed by an IndexSearcher would be the Matcher, What about adding the getMatcher() method to Filter and
deprecating bits(IndexReader)? Then when we release 3.0 we can remove bits() and the only method in Filter will be getMatcher(). Then this patch should be backwards compatible I had not thought about deprecating yet, but it should work nicely.
I suppose you want to add a class BitSetFilter (subclass of Filter) as the preferred alternative to the deprecated method? Initially Filter and BitSetFilter would be very similar, except that Filter.bits() would be deprecated. Later, after removal of Filter.bits(), Filter.getMatcher() would be declared abstract. I tried to do something pretty close to this for contrib/xml-query-parser, but I could not make that work, OK, here's a new version of the patch.
Changes:
All testcases pass and believe this should be fully backwards compatible. In Lucene 3.0 we should make the following changes:
A few remarks on the lucene-584-take2 patch:
In the @deprecated javadoc at Filter.bits() a reference to BitSetFilter could be added. While Filter.bits() is still deprecated, one could also use the BitSet in IndexSearcher A few complete (test) classes are deprecated, it might be good to add the target release For the rest this patch looks good to me. Did you also run ant test-contrib ? To go back to post #1 on this topic:
"Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint." Given the motivation to move to more memory efficient structures why is the only attempt at caching dedicated exclusively to caching the very structures we were trying to move away from?..... "I deprecated also CachingWrapperFilter and RemoteCachingWrapperFilter and added corresponding CachingBitSetFilter and RemoteCachingBitSetFilter" Does this suggest we are to have type-specific CachingXxxxxFilters and RemoteCachingXxxxxFilters created for every new filter type? Why not provide a single caching mechanism that works for all those other, new, more memory-efficient structures? I beleive the reason this hasn't been done is due to the issue I highlighted earlier - the cachable artefacts (what I chose to call "DocIdSet" here: action_12518642 ) are not modelled in a way which promotes re-use. That's why we would end up needing a specialised caching implementations for each type. If we are to move forward from the existing Lucene implementation it's important to note the change:
...looks like an Anti-pattern to me. Worse, this ties the choice of datastructure to the type of Filter that produces it. Why can't my RangeFilter be free to create a SortedVIntList or a BitSet depending on the sparseness of matches for a particular set of criteria? I'm not saying "lets just stick with Bitsets", just consider caching more in the design. Post action_12518642 lays out how this could be modelled with the introduction of DocIdSet and DocIdSetIterator as separate responsibilities (whereas Matcher currently combines them both). Cheers Mark, in the latest Matcher....-2default.patch there is the org.apache.lucene.MatcherProvider interface with this javadoc:
/** To be used in a cache to implement caching for a MatchFilter. */ This interface has only one method: public Matcher getMatcher(); There is also a cache for filters in the Matcher....3core.patch in the class CachingWrapperFilter . Would those be a good starting point? I'm getting lost as to which patches we're considering here. I was looking at lucene-584-take2 patch.
MatcherProvider in the earlier patch does look like something that will help with caching. >>Would those be a good starting point? Overall I feel uncomfortable with a lot of the classnames. I think the use of "Matcher" says more about what you want to do with the class in this particular case rather than what it does generally. I have other uses in mind for these classes that are outside of filtering search results. For me, these classes can be thought of much more simply as utility classes in the same mould as the java Collections API. Fundamentally, they are efficient implementations of sets/lists of integers with support for iterators. The whole thing would be a lot cleaner if classes were named around this scheme. In case there is a better name than Matcher for a Scorer without a score() method (and maybe without an explain() method), I'm all ears. Names are important, and at this point they can still be changed very easily.
For Matcher I'd rather have a method to estimate the number of matching docs than a size() method. This estimate would be useful in implementing conjunctions, as the Matchers with the lowest estimates could be used first. However, this is another issue. For the data structures (bitset/openbitset/sorted VintList/) I would suggest one of these: IntSet, IntegerSet or IntegerSequence as names for the common interface.
I did a quick Google for IntegerSet and you are in the number one spot, Paul // A cachable, immutable, sorted, threadsafe collection of ints. // A single-use thread-unsafe iterator. If detailed explanations of hits are required these should really sit with the source not the result- i.e. with the Filters. They contain all the match criteria used to populate IntegerSets and can be thought of more generically as IntegerSetBuilder. //Contains criteria to create a set of matching documents. MUST implement hashcode and equals based on this criteria to enable use as cache keys for IntegerSets. A single CachingIntegerSetBuilder class would be able to take ANY IntegerSetBuilder as a source, cache ANY type of IntegerSet they produced and defer back to the original IntegerSetBuilder for a full and thorough explanation of a match even when the match occurred on a cached IntegerSet, if required. class CachingIntegerSetBuilder implements IntegerSetBuilder The reason for introducing IntegerSetBuilder as a more generic name than "Filter" is IntegerSets have uses outside of filtering e.g. to do category counts or clustering. In these use cases they don't actually perform anything to do with filtering. It may actually be better named DocIdSetBuilder given that it is tied to Lucene's IndexReader and therefore limited to producing sets of document ids. For the moment a DocId is an int, but that might change to long sooner than we think. So DocIdSet... would be a better name than IntegerSet..., and it's better to use an abstract superclass than an interface:
abstract class DocIdSetIterator { boolean next(); boolean skipTo(int next); int doc(); } // and the rest is in the patch, except the superclass for Matcher: abstract class Matcher extends DocIdSetIterator { Explanation explain(int doc); } abstract class Scorer extends Matcher { float score(); ... } Would this DocIdSetIterator be close enough to the IntegerSetIterator? OK, I created a new patch based on the recent feedback.
This patch introduces two new abstract classes: /**
* A DocIdSet contains a set of doc ids. Implementing classes must provide
* a {@link DocIdSetIterator} to access the set.
*/
public abstract class DocIdSet {
public abstract DocIdSetIterator iterator();
}
/**
* This abstract class defines methods to iterate over a set of
* non-decreasing doc ids.
*/
public abstract class DocIdSetIterator {
public abstract int doc();
public abstract boolean next() throws IOException;
public abstract boolean skipTo(int target) throws IOException;
}
Additional changes:
I haven't changed the contrib Filter implementations yet, they still use All core & contrib tests pass. This patch should also be fully backwards On the take3 patch of 10 Jan 2008:
SortedVIntList extends DocIdSet: nice, thanks. PrefixGenerator is used but not defined in the patch, so it will not compile. Nevertheless, with all tests passing, I think this is a good way to Minor concerns: There is neither a BitSetFilter nor an OpenBitSetFilter in the patch. Some javadoc changes did not make it into the take3 patch, I'll check these later. FilteredQuery.explain(): When a document does not pass the Filter For the future: About adding a Filter as a clause to BooleanScorer, and adding
Thanks for the review!
Not sure I understand what you mean. PrefixGenerator is (and was)
I think that it should be straightforward for users having filters that use
Oh, which ones?
Yeah, it used to work this way, that's why I didn't change it for backwards-
Well, let's address this with a different issue after this one is committed. As for PrefixGenerator:
in my (up to date) trunk directory, this command: find . -name 'PrefixGenerator' only gave this result: ./build/classes/java/org/apache/lucene/search/PrefixGenerator.class and that disappeared after ant clean. It seems that the source class was removed from the trunk.
BitSetFilter would inherit from Filter, and have an abstract bits() method, not deprecated. OpenBitSetFilter similar, and that could be used in a few places in the patch iirc. The javadoc changes I meant came with Matcher and use 'match' consistently for documents
As I said, PrefixGenerator is defined in PrefixFilter.java.
OK I added a BitSetFilter. The rest of the patch is identical to take3. I thought that I copied all your javadoc changes that still applied (after the removal of Matcher, MatchFilter, etc.) over into this patch. But well, if you think I missed any important ones let me know or feel free to update the patch! I'm sorry about my PrefixGenerator remarks, I did not read your answer accurately.
On the take4 patch of 11 Jan 2008: I have started in a fresh trunk checkout that passed all tests. I remember having problems with moving contrib/xml-queryparser from Filter For me, this was the main reason to make this move: Less important for now: The test classes extend TestCase, but iirc there is also a LuceneTestCase for this. On the take4 patch ant javadocs-core gives this: I tried to move contrib from Filter.bits() to BitSetFilter.bits().
The ContribQueries20080111.patch does that with contrib/queries, For contrib/miscellaneous the changes needed to ChainedFilter are more involved: To make the tests pass, I had to make RangeFilter and QueryFilter subclasses of BitSetFilter, and to remove the final keyword from BitSetFilter.getDocIdSet(). Stepping back a bit, I think ChainedFilter might better move to OpenBitSetFilter. The conclusion is that I see no real problems with the take4 patch to move contrib it looks like ChainedFilter could become obsolete if Filter/DocSetIdIterator gets added as a Clause to the BooleanQuery? I am thinking along the lines: ChainedFilter evaluates boolean expression of docId-s, that is exactly what BooleanQuery does plus "a bit" more (scoring)...
The function is indeed the same, but ChainedFilter works directly on BitSets and BooleanQuery works on input Scorers/DocIdSetIterators and outputs collected docids (and score values). Working directly on (Open)BitSets is normally faster, so ChainedFilter can have a good use. That leaves the question on what to do with ChainedFilter here. Any ideas? hmm, in order to have fast and/or operations we need to know the type of the underlaying object in Filter, and sometimes we must use iterators (e.g. case where one Filter/DocSetId is int list and another Hash bit set ). I guess, knowing type of DocIdSet is the trick to pool.
Default implementation of ChainedFilter (there is also BooleanFilter somewhere in contrib, I like it more) should be using iterator (like scorers), and at a few key points checking if(first instance of SomeInstanceOfDocIdSet && second SomeInstanceOfDocIdSet) first.doFastOR/AND(second); something in that direction looks reasonable to me for ChainedFilter I moved Filter "forward" by removing the deprecated bits() method
and declaring the getDocIdSet() method abstract. To get ant test-core passing with this it was necessary to The change to Filter is not included in the patch. Upload once more, this time with licence.
I think I understand now which problems you had when you wanted to
change BooleanFilter and xml-query-parser to use the new Filter APIs. BooleanFilter is optimized to utilize BitSets for performing boolean Now we can introduce BitSetFilter as you suggested and what I did in One solution would be to add a getBitSet() method to DocIdBitSet. Then BooleanFilter could do something like this: DocIdSet docIdSet = filter.getDocIdSet(); if (docIdSet instanceof DocIdBitSet) { BitSet bits = ((DocIdBitSet) docIdSet).getBitSet(); ... // existing code } else { throw new UnsupportedOperationException("BooleanFilter only supports Filters that use DocIdBitSet."); } But then, changing the core filters to use OpenBitSets instead of So if we took this approach we would have to wait until 3.0 to move Any feedback about this is very welcome. I'll try to further think Michal, would this work?
1. providing default implementation for basic methods that is using skipping iterator(always there), so it works by default for all implementations, something along the lines: /**
public DocIdSet and(DocIdSet){ public DocIdSet or(DocIdSet){ // default implementation using iterator; } } 2. And then we optimize particular cases, e.g public class DocIdBitSet extends DocIdSet{ public DocIdSetIterator iterator(){ //this is easy... } public DocIdSet and(DocIdSet dis){ } So it works always, and it works fast if need be, one instanceof check does not hurt there. Did I miss something obvious? I indeed recall having an problem with remote filter caching. At the time I thought it was related to serialization but I could not resolve it that way. Never mind, it does not matter anymore.
BooleanFilter and ChainedFilter have the same issue here. As they provide just about the same functionality, could they perhaps be merged? The solution using DocIdSet.and() and DocIdSet.or() looks good to me, but it will require some form of collector for the results, much like HitCollector.collect(doc, score) now and MatchCollector.collect(doc) in the Matcher...patch. I'd like these boolean operations on DocIdSets to be general enough for use in Scorers, for example for the conjunctions in ConjunctionScorer, PhraseScorer and in the two NearSpans. But that is another issue. Hi Paul,
Just eyeballed the code but not had a chance to patch and run it yet. I was wondering about the return type for skipTo() after looking at these types of calls: You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be "false". Cheers
To change the signature of skipTo() would be an API change, because with this patch Scorer extends DocIdSetIterator. -Michael OK here's a new version of the patch.
It's based on the take4 patch with the following changes:
Comments:
All core & contrib tests pass. I'd like to commit this in a couple of days if nobody objects. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
there is some advanced things you are probably interested in.
see:
"some utilities for a compact sparse filter"
LUCENE-328Also interesting:
SOLR-15 OpenBitSet - ASF JIRA
complete solr solution for Filters is one cool thing! a bit awkward bridge to lucene due to BitSet in Filter, but this is due to be resolved...