|
thanks, this looks interesting.
Regards, As the title of this issue is as accurate as it gets, I'm attaching a series of patches and additions here that make Scorer a subclass of Matcher, while Matcher takes the current role of the BitSet in Filter.
All patches against trunk revision 417299. javadocs of HitCollector.java to use 'matching' instead of 'non-zero score'.
This is actually independent of the Matcher/Scorer change.
Paul Elschot made changes - 27/Jun/06 04:47 AM
javadocs of Searcher.java to use 'matching' instead of 'non-zero score',
and to describe the Filter effect more accurately. This is independent of the Matcher/Scorer change.
Paul Elschot made changes - 27/Jun/06 04:49 AM
MatchCollector.java with collect(int) method for org.apache.lucene.search.
Paul Elschot made changes - 27/Jun/06 04:52 AM
Matcher.java, including a match(MatchCollector) method, for org.apache.lucene.search.
Paul Elschot made changes - 27/Jun/06 04:53 AM
patch to Scorer.java to subclass Matcher.
Paul Elschot made changes - 27/Jun/06 04:55 AM
patch to Filter to add getMatcher() and to deprecate getBits() in favour of getMatcher().
Includes commented test code to test IndexSearcher using BitsMatcher.
Paul Elschot made changes - 27/Jun/06 04:58 AM
Patch to IndexSearcher.java to prefer getMatcher() over getBits() on Filter.
Also add method IndexSearcher.match(Query, MatchCollector).
Paul Elschot made changes - 27/Jun/06 05:00 AM
A Matcher constructed from a BitSet for org.apache.lucene.util.
Paul Elschot made changes - 27/Jun/06 05:02 AM
SortedVIntList.java for org.apache.lucene.util superseding the one in
Paul Elschot made changes - 27/Jun/06 05:05 AM
TestSortedVIntList.java, superseding the one in
Paul Elschot made changes - 27/Jun/06 05:07 AM
Paul Elschot made changes - 27/Jun/06 05:09 AM
I hope I got all the attachments right, please holler in case something does not patch or compile cleanly.
Some questions/remarks:
Regards, I've started to improve the javadocs of almost all code posted here, so it's probably not worthwhile to commit this as it is now.
I don't expect changes to the java code in the short term. Regards, Patches against trunk revision 417683, current.
Compared to previous patches/files, there are only javadoc updates, and the javadocs of Searchable are also patched.
Paul Elschot made changes - 28/Jun/06 02:19 PM
Paul Elschot made changes - 28/Jun/06 02:20 PM
Paul Elschot made changes - 28/Jun/06 02:22 PM
Paul Elschot made changes - 28/Jun/06 02:23 PM
Paul Elschot made changes - 28/Jun/06 02:23 PM
Paul Elschot made changes - 28/Jun/06 02:23 PM
Paul Elschot made changes - 28/Jun/06 02:24 PM
Paul Elschot made changes - 28/Jun/06 02:24 PM
Paul Elschot made changes - 28/Jun/06 02:25 PM
Paul Elschot made changes - 28/Jun/06 02:25 PM
Paul Elschot made changes - 28/Jun/06 02:25 PM
Added some javadocs to BitsMatcher.
Added Matcher constructor to SortedVIntList, and extended the test for this.
Paul Elschot made changes - 29/Jun/06 04:08 AM
Paul Elschot made changes - 29/Jun/06 04:08 AM
Paul Elschot made changes - 29/Jun/06 04:08 AM
Paul Elschot made changes - 29/Jun/06 04:09 AM
Perhaps the BitsMatcher class is better implemented as a static method in a new class MatcherUtils.createMatcher(BitSet).
Similar methods could be added for OpenBitSet, SortedVIntList, int[] and whatever data structure comes around for implementing Filter.getMatcher(IndexReader). When Matcher is a superclass of Scorer, TermScorer already implements a Matcher for TermDocs. As requested on java-dev, Matcher20060830.patch is the whole thing as a single patch, relative to srv/java/org/apache/lucene in the trunk, revision 438598 of 30 August 2006.
This does not contain the FIXME the earlier posted Filter-20060628.patch . Regards,
Paul Elschot made changes - 30/Aug/06 07:31 PM
Thanks Paul,
I like the Matcher/Scorer relation. It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right? When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs to be removed, or a check done in IndexSearcher.search() to disable skipping on the scorer if it's in use. I wonder what the performance impact is... for a dense search with a dense bitset filter, it looks like quite a bit of overhead is added (two calls in order to get the next doc, use of nextSetBit() instead of get(), checking "exhausted" each time and checking for -1 to set exhausted). I suppose one can always drop back to using a HitCollector for special cases though. Matcher20060830b.patch corrects 2 mistakes in Matcher20060830.patch:
Searchable.java was present twice, and TestSortedVIntList was not present. Thanks eks for pointing out the mistake. This patch was generated from the trunk directory. Regards,
Paul Elschot made changes - 30/Aug/06 08:52 PM
Yonik, as to you questions:
> It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right? Right. Only the previous Filter-20060628.patch contains some commented FIXME code to actually introduce a BitsMatcher in each case where a BitSet is used. >When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs Iirc the patch still supports the 1.4 BooleanScorer when a BitSet is returned by Filter. I'd have to have a look at the patched IndexSearcher to be sure though. > I wonder what the performance impact is... for a dense search with a dense bitset BitsMatcher could also work without the "exhausted" flag, but then an infinite loop Regards, using the latest Matcher20060830.patch
ant said "BUILD SUCCESSFUL" I will see how it works on some real life cases using our 50Mio collection and report back what our standard app level tests have to say (we have standardized collection /requsts/hits so bad things should pop -up quickly). Need a day or two for this. thanks for this work.
Paul Elschot made changes - 30/Aug/06 09:32 PM
Hi Paul,
for me, this patch did not cause any incompatibility issues. All our tests passed without noticing any difference to the previous trunk version. No performance changes as well ( we use HitCollector only, so Yoniks comment does not apply here). Tests are application level, and make index hot (6hrs searches with test batch of requests with known responses), 50Mio not artificial docs, real requests... Early this week we will try to implement our first Matchers and see how they behave > No performance changes as well.
It's good to hear that. As mentioned earlier, this is groundwork only. Which comment of Yonik related to HitCollector do you mean? > Early this week we will try to implement our first Matchers and see how they behave BitsMatcher and SortedVIntList could start that. Paul,
What is exact semantics of skipTo(int) in Matcher?
Is this intended behavior, "skip forward until you reach end, and then, you are at the end Here are some Matcher implementations,
-DenseOpenBitsMatcher - Using solr BitSetIterator (for skipTo() to work, one method in BitSetIterator should become public) Also attached one simple test (just basic fuctionality) that also contains one dummy relative performance test Perf. test simply iterates over different Matcher implementations and measures ellapsed time (not including Matcher creation, pure forward scan to the end) for different set bit densities. imho, this code is not sufficiantly tested nor commented, needs an hour or two. As expected, Yonik made this ButSetIterator really fast. What was surprise for me was OpenBitSet nextSetBit() comparing bad to the BitSet (or I made some dummy mistake somewhere?)
Eks Dev made changes - 04/Sep/06 09:02 PM
Paul,
What is next now, we did on our app enough experiments and are now sure that this patch causes no incompatibilities. We also tried to replace our filters with OpenBitSet and VInt matchers and results there are more than good, our app showed crazy 30% speed-up!!! Hard to identify where from exactly, but I suspect VInt matcher in case of not too dense BitVectors increased our Filter Cache utilization significantly. I would propose to commit this patch before we go further with something that would actually utilize Matcher. Just to avoid creating monster patch on patch ... This is ground work, and now using Matcher will be pure poetry, I see a lot of places we could see beter life by using use Matchers, ConstantScoringQuery, PreffixFilter, ChainedFilter (becomes obsolete now)... actually replace all uses of BitSet with OpenBitSet (or a bit smarter with SortedIntList. VInt...)... Do you, or someone else see now things to be done before commiting this? > Do you, or someone else see now things to be done before commiting this?
Yes. In the steps listed here: Paul Elschot. In the inheritance from Matcher to Scorer there is an asymmetry
in this patch. Matcher provides a default implementation for Matcher.explain() With matcher implementations quite a few other implementation Regards, I wrote:
> One could add an abstract Scorer.explain() to catch these, or by mistake. The good news is that the patch leaves the
Paul Elschot made changes - 01/Nov/06 07:59 PM
Corrected javadoc refs to use IndexInput and IndexOutput.
Paul Elschot made changes - 01/Nov/06 08:01 PM
I have just resolved some minor local conflicts on the updated copyrights of four java files.
Please holler when a fresh patch is needed. As 2.1 is out, here is a new patch to try and revive this.
This replaces the pevious Matcheryyyymmdd.patch one of 2006
Paul Elschot made changes - 26/Feb/07 08:03 PM
Paul Elschot made changes - 26/Feb/07 08:04 PM
It's been a while since i looked at this issue, but it's come up in discussion recently so i took another glance...
Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses it to decide whether or not to iterator over the (non null) Matcher, or over the BitSet from Filter.bits. is there any reason that logic can't be put in getMatcher, so that if subclasses of Filter don't override the getMatcher method it will call bits and then return a Matcher that iterates over the set Bits? (this is the roll-out approach i advocated a while back when discussing this on email, excecept that at the time Matcher was refered to as SearchFilter: http://www.nabble.com/RE%3A-Filter-p2605271.html Thinking about it now, we could even change Filter.bits so it's no longer abstract ... it could have an implementation that would call getMatcher, and iterate over all of the matched docs setting bits on a BitSet that is then returned ... the class would still be abstract, and the class javadocs would make it clear that subclasses must override at least one of the methods ... legacy Filters will work fine because they'll already have a bits method, and people writing new Filters will see that bits is deprecated, so they'll just write a getMatcher method and be done. This appears to be the same approach taken with Analyzer.tokenStream back in 1.4.3... http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/analysis/Analyzer.html Hoss,
>Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses Two reasons:
There is also > we could even change Filter.bits so it's no longer abstract ... it could have I must say that creating a BitSet from a Matcher never occurred to me. Paul:
Applied the patch, applied cleanly, run ant test -> BUILD SUCCESSFUL I'm primarily interested in using this in order to get matches, but avoid scoring. From what I can tell, I'd just need to switch to using the new match(Query, MatchCollector) method in IndexSearcher. However, I need Sort and TopFieldDocs, and I don't see a match method with those. Is there a reason why such a match method is not in the patch? Otis:
> However, I need Sort and TopFieldDocs, and I don't see a match method with those. A bit silly perhaps, but what sort criterion would like to have used when no score() value is available? I don't know the sorting code, but it might be possible to use a field value for sorting. > BitsMatcher could also work without the "exhausted" flag, but then an infinite loop
> might occur when trying to continue after the first time next() or skipTo() returned false. > Continuing after false was returned in these cases is a bug, however an infinite loop > can be difficult to debug. I'd rather be on the safe side of that with the exhausted flag and wait for an actual > profile to show the performance problem. We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed. > We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed.
I fully agree. The "exhausted" flag is not much more than a matter of taste. In case the speed advantage of removing this flag is preferred, I don't mind resolving the eventual conflict in my working copy. > In case the speed advantage of removing this flag is preferred, I don't mind resolving the eventual conflict in my
> working copy. But I don't know yet how I would resolve that conflict Ah, that's a bit different. You mean there are cases that are non-trivial to fix where next() is called after false is returned? Perhaps I did something wrong with the benchmark, but I didn't get any speed-up when using searcher.match(Query, MatchCollector) vs. searcher.search(Query, HitCollector).
Here are the benchmark numbers (50000 queries with each), HitCollector first, MatchCollector second: HITCOLLECTOR: [java] ------------> Report Sum By (any) Name (11 about 41 out of 41) [java] ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41) MATCHCOLLECTOR: [java] ------------> Report Sum By (any) Name (11 about 41 out of 41) [java] ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41) This is what I did for the benchmark. I used Doron's handy conf/benchmark. $ diff conf/micro-standard.alg conf/matcher-micro-standard.alg Then I added 2 new Tasks for benchamrking the Matcher (searcher.search(Query, MatchCollector)) and modified the ReadTask to call searcher.search(Query, HitCollector) instead of the method to get Hits. I commented out all search results traversal and doc retrieval, as I didn't care to measure that.
Otis Gospodnetic made changes - 05/Apr/07 09:33 PM
The benchmark does not search with filters. Is any speedup still expected? (why?)
I applied the patch on current trunk and ran the benchmark - it shows that when all queries use the same reader, Match is faster while when each query opens its own reader bitset is faster. Is this an expected result? Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem SrchMtchSamRdr_5000 - 10 5000 642.2 77.85 12,331,866 16,408,576 SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 586.9 - - 85.20 - 9,515,875 - 12,009,472 SrchMtchNewRdr_500 - 10 500 134.7 37.11 13,376,113 17,171,660 This test is using all Reuters documents and the searches rounds are repeated 10 times. The Match tasks were not included so I wrote them. The updated bench-diff.txt attached contains these task classes and the algorithm. (When you use this, note that once the index is created you can comment the first part - the "Populate" part - and then only rerun the querying part.)
Doron Cohen made changes - 08/Apr/07 03:59 AM
One line was cut out - here are the four lines again
Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem Doron, thanks for jumping on this!
1. I thought I'd see better performance with the Matcher because it skips scoring. While Paul's patch does make changes to the Filtering code, I'm more focused on HitCollector vs. MatchCollector performance here. Am I missing something here? If scoring is skipped, we should see at least some speed improvement, and your results show that. 2. You said you did see MatchCollector was faster than HitCollector. Hmmm, weird, not in my 4 runs: Matcher: I'll try it again on a different computer. My previous runs were on a Mac with OSX. 3. My bench-diff.txt did include Match tasks: $ grep Match bench-diff.txt | grep class ... but I didn't svn add them, so I produced the "diff" by simply cat-ing the new tasks to bench-diff.txt . So if you used my bench-diff.txt as a patch, it wouldn't have worked. Not a big deal, just clarifying. ...right, your diff-txt had the Match tasks - I missed that - checked it, it is exactly what I did, so we're ok here.
When you rerun, you may want to use my alg - to compare the two approaches in one run. You can run this by something like: Also, to get cleaner results, add the line: I tried like this twice, and got inconsistent results: When the bitset searches preceded the match searches: When the match searches preceded the bitset searches: My conclusion from this is that the speed-up, if exists, is minor, at least for the setup of this test. There are only 15 unique queries in this test - also printed in the log - are these the queries you would expect to save in? I didn't follow this issue very closely so I don't know where the saving is expected here. Both SearchTask and MatchTask now do nothing in collect, so no difference at the actual collect() call. Also, Scorer.score(HitCollector) and Matcher.match(MatchCollector) are very similar: I think there is a potential gain for large boolean OR queries, because score() would have to call next() on all TermScorers and collect/sum their scores, while match() could use skipTo(last+1) because any match encountered is a match and there is no need to sum the individual scores for the same doc by other scorers. However as far as I can tell, current match() implementation does not take advantage of this, but I may be overlooking something? > When you rerun, you may want to use my alg - to compare the two approaches in one run.
This is more dangerous though. GC from one method's garbage can penalize the 2nd methods performance. The hotspot effect can be minimized by running the test multiple times in the same JVM instance and discarding the first runs, but it's not so easy for GC. Instead of discarding the first run, the approach I usually take is to run 3-4 times and pick the minimum. You can then run several of these "sets" and average over the minimum of each. GC is still an issues, though. It is hard to get around when it is a mark&sweep collector (reference counting is much friendlier in this regard)
> > When you rerun, you may want to use my alg - to compare the two approaches in one run.
> This is more dangerous though. Agree. I was trying to get rid of this by splitting each round to 3: - gc(), warm(), work() - when work() and warm() are the same, just that warm()'s stats are disregarded. Still switching the order of "by match" and "by bits" yield different results. Sometimes we would like not to disregard GC - in particular if one approach is creating more (or more complex) garbage than another approach. Perhaps we should look at two measures: best & avg/sum (2nd ignoring first run, for hotspot). Doron: just to address your question from Apr/7 - I expect/hope to see an improvement in performance because of this difference:
hc.collect(doc(), score()); the delta being the cost of the score() call that does the scoring. If I understand things correctly, that means that what grant described at the bottom of http://lucene.apache.org/java/docs/scoring.html > No Scorer, no BooleanScorer(2), no ConjunctionScorer...
Thanks, I was reading "score" instead of "score()"... But there is a scorer in the process, it is used for next()-ing to matched docs. So most of the work - preparing to be able to compute the scores - was done already. The scorer doc queue is created and populated. Not calling score() is saving the (final) looping on the scorers for aggregating their scores, multiplying by coord factor, etc. I assume this is why only a small speed up is seen. Ahhhh. I'll look at the patch again tomorrow and follow what you said. All this time I was under the impression that one of the points or at least side-effects of the Matcher was that scoring was skipped, which would be perfect where matches are ordered by anything other than relevance.
DisjunctionSumScorer (the ORScorer) actually calls Scorer.score() on all of the matching scorers in the ScorerDocQueue during next(), in order to accumulate an aggregate score. The MatchCollector can't save you from that.
That could be improved in a DisjunctionMatcher.
With a bit of bookkeeping DisjunctionSumScorer could also delay calling score() on the subscorers but the bookkeeping would affect performance for the normal case. For the usual queries the score() call will never have much of a performance impact. To determine which documents match the query, the index need to be accessed, Ah, too bad.
Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel boxes, Intel boxes with Linux on them actually won, and my impression was that this was due to Niagara's weak FPU (a known weakness in Niagara, I believe). Thus, I thought, if we could just skip scoring and various floating point calculations, we'd see better performance, esp. on Niagara boxes. Paul, when you say "fastest cache", what exactly are you referring to? The Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at least the JVM had plenty of RAM to work with. By fastest cache I meant the L1 cache of the processor. The size is normally in tens of kilobytes.
An array lookup hitting that cache takes about as much time as a floating point addition. During a query search the use of a.o. the term frequencies, the proximity data, and the document weights normally cause an L1 cache miss. I would expect that by not doing the score value computations, only the cache misses for document weights can be saved. I'm a little behind on following this issue, but if i can attempt to sum up the recent discussion about performance...
"Migrating towards a "Matcher" API may allow some types of Queries to be faster in situations where clients can use a MatchCollector instead of a HitCollector, but this won't be a silver bullet performance win for all Query classes – just those where some of the score calculations is (or can be) isolated to the score method (as opposed to skipTO or next)" I think it's important to remember the motivation of this issue wasn't to improve the speed performance of non-scoring searchers, it was to decouple the concept of "Filtering" results away from needing to populate a (potentially large) BitSet when the logic neccessary for Filtering can easily be expressed in terms of a doc iterator (aka: a Matcher) – opening up the possibility of memory performance improvements. A second benefit that has arisen as the issue evolved, has been the API generalization of the "Matcher" concept to be a super class of Scorer for simpler APIs moving forward. Right. I was under the wrong impression that the Matcher also happens to avoid scoring. However, now that we've all looked at this patch (still applies cleanly and unit tests all pass), and nobody had any criticisms, I think we should commit it, say this Friday.
As I'm in the performance squeezing mode, I'll go look at I'm still behind on following this issue, but Otis: if you are interested in moving forward with this, you might consider trying the cahnges i proposed in my "15/Mar/07 11:06 AM" Comment...
https://issues.apache.org/jira/browse/LUCENE-584#action_12481263 ...I think it would keep IndexSearcher a little cleaner, and make it easier for people to migrate existing Filter's gradually (without requiring extra work for people writing new "Matcher" style Filters from scratch) With 2.2 out, and
The patch still applies cleanly, and I'd like to start working on a skipping extension of SortedVIntList, much like the latest index format for document lists. This DefaultMatcher2007072.patch adds a default Matcher to be used in Filter instead of the BitSet . It contains static methods that create a default Matcher from a BitSet and from an OpenBitSet. The patch also add OpenBitSet to org.apache.lucene.util; it was taken from a recent solr revision.
In this way the deprecation of Filter.bits(IndexReader) can be done by replacing implementations of that method by Filter.getMatcher(IndexReader) and adding the above default Matcher in the return statement: The idea is to have this hook available so that a sensible default Matcher is easily available, that can also be adapted to use better Matcher implementations when these become available. I have begun introducing the default matcher in my working copy of the core, but as expected, that turns out to be quite a bit of work.
Paul Elschot made changes - 25/Jul/07 10:21 AM
There is some code in contrib where a Filter is assumed to have BitSet available:
contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java When Filter is going to move from BitSet to Matcher, these will have to be reimplemented. The whole thing in three patches:
The Matcher-ground patch is the groundwork, which should be very similar to the earlier groundwork patch. The Matcher-default patch provides a default implementation, the same as the one I posted earlier today. Among others, It includes OpenBitSet from solr in org.apache.lucene.util . The Matcher-core uses the default implementation inside the rest of the lucene core and test code. It replaces Filter.bits() methods with Filter.getMatcher() methods in the subclasses of Filter. All core tests pass with these patches aplied., except the one below. [junit] Testsuite: org.apache.lucene.search.TestRemoteCachingWrapperFilter Finally, contrib may not even compile with the patches applied.
Paul Elschot made changes - 25/Jul/07 06:12 PM
Paul Elschot made changes - 25/Jul/07 06:13 PM
Paul Elschot made changes - 25/Jul/07 06:14 PM
Paul Elschot made changes - 25/Jul/07 06:15 PM
Paul Elschot made changes - 25/Jul/07 06:15 PM
Paul Elschot made changes - 25/Jul/07 06:15 PM
Paul Elschot made changes - 25/Jul/07 06:15 PM
Paul Elschot made changes - 25/Jul/07 06:15 PM
Paul Elschot made changes - 25/Jul/07 06:16 PM
Paul Elschot made changes - 25/Jul/07 06:16 PM
Paul Elschot made changes - 25/Jul/07 06:16 PM
Paul Elschot made changes - 25/Jul/07 06:16 PM
Paul Elschot made changes - 25/Jul/07 06:16 PM
Paul Elschot made changes - 25/Jul/07 06:16 PM
Hi Paul,
Not sure if I'm missing something but I think this patch may not work for scenarios other than the simple option of a single filter being used on a search. A Matcher does not have the same utility as a BitSet because using a BitSet you can: 1) iterate across it using multiple threads. I think these differences become important in the following scenarios : In CachingWrapperFilter I don't think you can cache Matchers instead of bitsets - because Matchers don't have features 1 and 4 BooleanFilter and ChainedFilter in contrib don't work with Matchers because there is no support for 3) Is there something obvious I've missed? Cheers Have a look at BitSetMatcher in the -default patch. It is constructed from a BitSet, and it has a method getMatcher() that returns a Matcher that acts as a searching iterator over the BitSet.
So that is 1) to 4), at least potentially. A clone() method is currently not implemented iirc, but each call to getMatcher() will return a new iterator over the underlying BitSet. And when guaranteed non modifyability is needed, a constructor can take a copy of the given document set, in whatever form. The point of Matcher is that it allows other implementations than BitSet, like OpenBitSet and SortedVIntList. Both have the properties that you are looking for. SortedVIntList can I'd like to have a skip list version of SortedVIntList, too. This would be slightly larger than SortedVIntList, but more efficient on skipTo(). But the first thing that is necessary is to have Filter independent from BitSet. The real pain with that is going to be the code that currently implements Filters The default implementation will probably need to be improved from its current Cheers, I forgot to mention that boolean logic on Matchers is already in present in BooleanScorer2.
This is because each Scorer is a Matcher. Thanks for the reply, Paul.
I saw BitSetMatcher etc and appreciate the motivation behind the design for alternative implementations . What concerns me with the Matcher API in general is that Matchers have non-threadsafe safe state (i.e. the current position required to support next() )and as such aren't safely cachable in the same way as BitSets. I see the searcher code uses the safer skipTo() rather than next() but there's still the "if(exhausted)" thread safety problem to worry about which is why I raised points 1 and 4. Additionally, combining Bitsets using Booolean logic is one method call whereas combining heterogenous Matchers using Boolean logic requires iteration across them and therefore potentially many method calls (point 3). I haven't benchmarked this but I imagine it to be significantly slower? Cheers Mark,
The exhausted flag is only in the iterator/Matcher, not in the underlying set data structure. One can use as many iterators as necessary, for example one per thread, and then there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.) You wrote: I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations. That can still be done, but one needs to get to the BitSets for example by caching them outside the Filters and constructing the resulting BitSetMatcher for the combined Filter on the fly. An alternative would be to have a BooleanQuery.add(Matcher, Occur), where the occurrence can only be required or prohibited. Then there is no need to construct any resulting filter because the boolean logic will be executed during the search. This might even be more efficient than combining the full BitSets ahead of the search. And with many large BitSets cache memory savings from more compact implementations can also be helpful. Mark,
An easy way to keep things like BooleanFilter working could be to That would only involve some deprecation warnings in BitsFilter for the period I would not even mind cooking this up as patch to contrib. Thoughts? A different take in the patches of 20070730.
In this version class Filter has only one method: Class BitSetFilter is added as a subclass of Filter, and it has the familiar In the ..core.. patch, and in the ..contrib.. patches (to follow), most uses of Filter simply replaced by BitSetFilter. This turned out to be an easy way of dealing This change to Filter and its replacement by BitSetFilter could well be taking The ..default.. patch contains a default implementation of a Matcher from a BitSet, and it has OpenBitSet and friends from solr, as well as SortedVIntList as posted earlier.
Paul Elschot made changes - 30/Jul/07 07:15 PM
Paul Elschot made changes - 30/Jul/07 07:16 PM
Paul Elschot made changes - 30/Jul/07 07:16 PM
Paul Elschot made changes - 30/Jul/07 07:17 PM
Some 20070730 patches to contrib using BitSetFilter.
The contrib-misc and contrib-queries patches are reasonbly good, their tests pass and replacing Filter by BitSetFilter is right for them. However, I'm not happy with the contrib-xml patch to the xml-query parser.
Paul Elschot made changes - 30/Jul/07 07:24 PM
Uploading the patches again, this time with the ASF license.
Paul Elschot made changes - 30/Jul/07 07:27 PM
Paul Elschot made changes - 30/Jul/07 07:28 PM
Paul Elschot made changes - 30/Jul/07 07:28 PM
Paul Elschot made changes - 30/Jul/07 07:28 PM
Paul Elschot made changes - 30/Jul/07 07:28 PM
Paul Elschot made changes - 30/Jul/07 07:28 PM
Paul Elschot made changes - 30/Jul/07 07:28 PM
Paul Elschot made changes - 30/Jul/07 07:29 PM
Some more remarks on the 20070730 patches.
To recap, this introduces Matcher as a superclass of Scorer to take the role that BitSet currently has in Filter. The total number of java files changed/added by these patches is 47, so some extra care will be needed. What approach should be taken for the API change to Filter (see above, 2 comments up)? I'd like to get all test cases to pass again. TestRemoteCachingWrapperFilter still does not pass, and For xml-query-parser in contrib I'd like to know in which direction to proceed (see 1 comment up). The ..default.. patch has taken OpenBitSet and friends from solr to have a default implementation. Once these issues have been resolved far enough, I would recommend to introduce this shortly after a release so there is some time to let things settle. Hi Paul,
Many thanks for your responses. Sorry for the delay in communications - just got back from 2 weeks holiday and slowly picking my way through this patch. You said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)" Did you mean BitSetFilter.getMatcher()? BitSetMatcher has no getMatcher method. If so, doesn't my original thread safety issue still stand? - CachingWrapperFilter is caching Matchers (not Filters which are factories for matchers). The existing approach of adding a <CachedFilter> tag around my XML-based query templates offers a major speed up in my applications and I don't see this supported in this patch currently which gives me some concern. This existing caching technique is based on the use of CachingWrapperFilter. The proposed framework seems to be missing a means of caching reusable, threadsafe Matchers in a type-independent fashion. One solution (which I think you may be suggesting with the "getMatcher" comment) is to cache Filter objects and use Filter.getMatcher(reader) as a factory method for thread-specific, single-use Matchers but this would suggest that any caching then becomes an implied responsibility/overhead of each Filter implementation. Not too great. CachingWrapperFilter is an example of a better design where the caching policy has been implemented in a single class and it can be used to decorate any Filter implementation (RangeFilter etc) with the required caching behaviour. Unfortunately with this proposed patch there is no way that any such single caching policy can work with any Filter because Matcher is not reusable/cachable. Time to remove any thread-specific state from Matcher? Cheers Some further thought on the roles/responsibilities of the various components:
Given a blank sheet of paper (a luxury we may not have) the minimum requirements I would have could be met with the following: interface DocIdSetFactory interface DocIdSet interface DocIdSetIterator In the existing proposal it feels like DocIdSet and DocIdSetIterator are rolled into one in the form of the Matcher which complicates/prevents caching strategies. Cheers Mark,
I said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)" A Matcher is intended to be used in a single thread, so I don't expect thread safety problems. The problem for the XML parser is that with this patch, the implementing data structure of a Filter becomes There is indeed no caching of filters in this patch. As for DocIdSet, as long as this provides a Matcher as an iterator, it can be used to I don't think this patch complicates the implementation of caching strategies. I did consider defining Matcher as an interface, but I preferred not to do that because Hi Paul,
Not sure we've reached a common understanding here yet. You said "That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList has a getMatcher() method, and I confused the two. " I too made a mistake in my earlier comments. DocIdSetIterator does NOT have "probably one implementation". There would be an implementation for each different type of DocIdSet (Bitset/OpenBitSet/VIntList). You said "some Filters do not need a cache. For example: TermFilter". I'm not sure why that has been singled out as not worthy of caching. I have certain terms (e.g. gender:male) where the TermDocs is very large (50% of all docs in the index!) so multiple calls to TermDocs for term "gender:male" (if that is what you are suggesting) is highly undesirable. These are typically handled in the XMLQueryParser using syntax like this: You said: "CachingWrapperFilter could then become a cache for BitSetFilter. " If this is undesirable (my guess is "yes") then the proposal in my previous comment is a solution which allows for caching of any/all types of the new sets (openBitSet,BitSet,SortedVIntList etc) Regardless of my choice of class names or decisions over interfaces vs abstract classes do you not at least agree the need for 3 types of functionality: 1) A factory for instantiating sets of document ids matching a particular set of criteria (which can be costly to call). While the factory is not expected to implement a caching strategy it is expected to implement hashcode/equals simply to aid any caching services which would need this help to identify previously instantiated sets which share the same criteria as ant new requests (This service I identified as my "DocIdSetFactory" and TermsFilter/RangeFilter would be example implementations). Each type of functionality can have different implementations so the functionality must be defined using an interface or abstract class. Cheers Mark,
I think we are one the same line, it's just that I don't want to go that far now. And that also makes it a starting point for caching of different data structures for Filters. The DefaultMatcher class tries to do some compression by using a SortedVIntList when that is smaller than a BitSet, and that is about as far as I'd like to go now. Proost, OK, I appreciate caching may not be a top priority in this proposal but I have live systems in production using XMLQueryParser and which use the existing core facilities for caching. As it stands this proposal breaks this functionality (see "FIXME" in contrib's CachedFilterBuilder and my concerns over use of unthreadsafe Matcher in the core class CachingWrapperFilter)
I am obviously concerned by this and keen to help shape a solution which preserves the existing capabilities while adding your new functionality. I'm not sure I share your view that support for caching can be treated as a separate issue to be dealt with at a later date. There are a larger number of changes proposed in this patch and if the design does not at least consider future caching issues now, I suspect much will have to be reworked later. The change I can envisage most clearly is expressed in my concern that the DocIdSet and DocIdSetIterator services I outlined are being combined in Matcher as it stands now and these functions will have to be separated. Cheers I, unfortunately, haven't had the time to read through everything in the latest patches, but catching up on my jira mail one of Paul's comments jumped out at me, so i wanted to make sure it's completley clear: this latest set of patches completely breaks backwards compatibility for any clients who have Filter subclasses, or methods that take a Filter as a param, since the Filter class now has an abstract getMatcher method and no longer supports an abstract BitSet method – presumably the expectation being that all client code should have a search/replace done from Filter=>BitSetFilter
which begs the question: why not eliminate BitSetFilter and move it's getMatcher impl to the Filter class? (if the concern is just that there be a "higher level" class in which both methods are abstract, why not insert a parent with some new name above the Filter class?) For the record: it really bothers me that the old attachments got deleted ... the inability to refresh my memory by looking at the older patches and compare them with the current patches is extremely frustrating This set of patches indeed break backward compatibility with the current Filter class.
That was done to show the ideal end situation, and to make sure that the patched code is indeed there. To get backward compatibility I'd prefer to temporarily copy the functionality from BitSetFilter into the Filter class, while still leaving BitSetFilter as it is: public class Filter { /** this method will become abstract once the bits() method is removed from Filter: */ The main difference between the current set of patches and the removed patches is indicated Another way to decouple from BitSet would be to keep introduce a new superclass of Filter that only has an abstract getMatcher() method, and to add an implementation of that method in the current Filter class.
That would boil down to the current patch with two classes renamed: Filter -> new class with abstract getMatcher() method. BitSetFilter -> Filter. This would avoid all backward compatibility issues, except for the unlikely case in which a getMatcher() method is already implemented in an existing subclass of Filter. For the lucene core and some of the contrib, this would mean that it would move to this new superclass of Filter. Again, I don't expect backward compatibility issues there. Does anyone see any problems with this approach? This time (20070905), as indicated in the previous post, a set of patches that add MatchFilter as the new superclass of Filter. Backward compatibility is quite good, no changes at all are necessary in contrib.
In the 1ground patch, the current core API is moved from Filter to MatchFilter. Since Filter is a subclass of MatchFilter, I do not expect backward compatibility issues with this, but it is a quite extensive API change. In the 2default patch, some support for MatchFilter caching was added in classes DefaultMatcher and MatcherProvider. OpenBitSet and some support for that was added from solr here, but OpenBitSet is not used (yet). SortedVIntList is also added, and this is used for caching in CachingWrapperFilter as below. In the 3core patch, this caching support is used in CachingWrapperFilter. See also java-dev of yesterday and today for a fixed thread safety problem there. I also added the Apache Licence to all new files. All tests pass with the patches applied, core and contrib. These 3 patches modify 35 source code files, so please tread carefully. They were generated against revision 573048. I will remove my previous patch set in a week or so.
Paul Elschot made changes - 05/Sep/07 09:27 PM
Paul Elschot made changes - 18/Sep/07 03:27 PM
Paul Elschot made changes - 18/Sep/07 03:27 PM
Paul Elschot made changes - 18/Sep/07 03:28 PM
Paul Elschot made changes - 18/Sep/07 03:28 PM
Paul Elschot made changes - 18/Sep/07 03:28 PM
Paul Elschot made changes - 18/Sep/07 03:28 PM
I don't think
public interface AbstractBitSet is according to standards. It should just be public interface BitSet possibly public interface IBitSet if coming from the Windows world. Since it is in a different package, there is no collision with the standard BitSet class. The posted patch proposes to use this class to determine which documents should be filtered:
public abstract class Matcher { This class is then used as a superclass of org.apache.lucene.search.Scorer. As the current patch set is large, I've been pondering how to do this in a series of smaller patches that can each be applied by itself. This is possible in the following way:
1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently. Step 1 can be reasonably done before a new a release. After that three (almost) independent paths can be followed: My question is: shall I go ahead and provide a patch for step 1? At the moment I'm refining BooleanScorer2. to use Matcher. This is for the case of multiple prohibited clauses, and also to allow the use of required and prohibited Matchers to allow adding filtering clauses to BooleanQuery. This Matcher-20071008-1ground.patch replaces the previous version because in between there was a conclict with the javadocs of Scorer for document ordering.
In today's version, Scorer is unchanged, except for the superclass Matcher, and Matcher reuses the javadocs of Scorer as much as possible.
Paul Elschot made changes - 08/Oct/07 06:13 PM
Paul Elschot made changes - 08/Oct/07 06:13 PM
Attached once more, this time with licence granted to ASF.
Paul Elschot made changes - 08/Oct/07 06:15 PM
Resolved a local conflict in the javadocs of HitCollector.
Paul Elschot made changes - 22/Nov/07 06:31 PM
Paul Elschot made changes - 22/Nov/07 06:31 PM
Paul Elschot made changes - 22/Nov/07 06:32 PM
Resolved a local conflict in the javadocs of HitCollector. This time with licence granted to ASF.
Paul Elschot made changes - 22/Nov/07 06:33 PM
Paul Elschot made changes - 22/Nov/07 06:34 PM
Paul, I like the iterative plan you suggested. I started reviewing the
The patch is backwards compatible, except for current subclasses of Filter already have a getMatcher method. The fact that no changes are needed to contrib confirms the compatibility.
I have made no performance tests on BitSetMatcher for two reasons. DefaultMatcher should be in the ...2default... patch. So far, I never tried to use these patches on their own, I have only split them for a better overview. Splitting the combined patches to iterate would need a different split, as you found out. It might even be necessary to split within a single class, but I'll gladly do that.
Michael Busch made changes - 28/Nov/07 09:15 AM
OK, here's a patch that compiles cleanly on current trunk and all tests
pass. It includes:
Would you be up for providing testcases? As I said I haven't fully reviewed the patch, but I'm planning to do
Michael Busch made changes - 28/Nov/07 09:25 AM
With the full patch applied, the following test cases use a BitSetMatcher:
TestQueryParser so I don't think it is necessary to provide seperate test cases. Yes you're right, I ran the tests w/ code coverage analysis enabled, and the
BitSetMatcher is fully covered. Good!
I think that custom Searcher or Searchable implementations won't compile anymore? @@ -86,13 +86,14 @@
* <p>Called by {@link Hits}.
*
* <p>Applications should usually call {@link Searcher#search(Query)} or
- * {@link Searcher#search(Query,Filter)} instead.
+ * {@link Searcher#search(Query,MatchFilter)} instead.
* @throws BooleanQuery.TooManyClauses
*/
- TopDocs search(Weight weight, Filter filter, int n) throws IOException;
+ TopDocs search(Weight weight, MatchFilter filter, int n) throws IOException;
I tried implementing a Searchable, and indeed ran into compilation errors.
So, backward compatibility is indeed not complete. Also, Searchable is an interface, so it should not be changed. There are two ways out of this: Do a name change on MatcherFilter/Filter -> Filter/BitSetFilter. I'd prefer to add an interface (or abstract class?) like Searchable that uses MatchFilter for those implementers that want to take advantage of MatchFilter. Why do we actually need the new MatchFilter class at all?
Filter is an abstract class, not an interface. So I think we could simply add the new method getMatcher() like you already did in your patch: /** * @return A Matcher constructed from the provided BitSet. * @see DefaultMatcher#defaultMatcher(BitSet) */ public Matcher getMatcher(IndexReader reader) throws IOException { return new BitSetMatcher(bits(reader)); } This shouldn't break existing Filter implementations? For example, OpenBitSetFilter would like this:
class OpenBitSetFilter /* ... */ {
OpenBitSet bits(reader) { ... }
Matcher getMatcher(reader) { ... }
}
Since the only thing needed by an IndexSearcher would be the Matcher, What about adding the getMatcher() method to Filter and
deprecating bits(IndexReader)? Then when we release 3.0 we can remove bits() and the only method in Filter will be getMatcher(). Then this patch should be backwards compatible I had not thought about deprecating yet, but it should work nicely.
I suppose you want to add a class BitSetFilter (subclass of Filter) as the preferred alternative to the deprecated method? Initially Filter and BitSetFilter would be very similar, except that Filter.bits() would be deprecated. Later, after removal of Filter.bits(), Filter.getMatcher() would be declared abstract. I tried to do something pretty close to this for contrib/xml-query-parser, but I could not make that work, OK, here's a new version of the patch.
Changes:
All testcases pass and believe this should be fully backwards compatible. In Lucene 3.0 we should make the following changes:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
there is some advanced things you are probably interested in.
see:
"some utilities for a compact sparse filter"
LUCENE-328Also interesting:
SOLR-15 OpenBitSet - ASF JIRA
complete solr solution for Filters is one cool thing! a bit awkward bridge to lucene due to BitSet in Filter, but this is due to be resolved...