Issue Details (XML | Word | Printable)

Key: LUCENE-1187
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael Busch
Reporter: Paul Elschot
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Things to be done now that Filter is independent from BitSet

Created: 23/Feb/08 10:33 AM   Updated: 03/Jun/08 07:08 PM
Return to search
Component/s: contrib/*, Search
Affects Version/s: None
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works BooleanFilter20080325.patch 2008-03-25 07:48 PM Paul Elschot 17 kB
Text File Licensed for inclusion in ASF works ChainedFilterAndCachingFilterTest.patch 2008-03-13 10:54 PM Mark Miller 2 kB
Text File Licensed for inclusion in ASF works Contrib20080325.patch 2008-03-25 10:31 PM Paul Elschot 26 kB
Text File Licensed for inclusion in ASF works Contrib20080326.patch 2008-03-26 06:23 PM Paul Elschot 27 kB
Text File Licensed for inclusion in ASF works Contrib20080427.patch 2008-04-27 07:19 PM Paul Elschot 28 kB
Text File Licensed for inclusion in ASF works javadocsZero2Match.patch 2008-02-26 09:30 PM Paul Elschot 2 kB
Text File Licensed for inclusion in ASF works lucene-1187.patch 2008-05-22 10:57 PM Michael Busch 48 kB
Text File Licensed for inclusion in ASF works lucene-1187.patch 2008-05-22 08:15 AM Michael Busch 32 kB
Text File Licensed for inclusion in ASF works OpenBitSetDISI-20080322.patch 2008-03-22 09:44 PM Paul Elschot 2 kB

Lucene Fields: Patch Available, New
Resolution Date: 23/May/08 07:27 PM


 Description  « Hide
(Aside: where is the documentation on how to mark up text in jira comments?)

The following things are left over after LUCENE-584 :

For Lucene 3.0 Filter.bits() will have to be removed.

There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the boolean behaviour of a Filter.

I have not looked into Filter caching yet, but I suppose there will be some room for improvement there.
Iirc the current core has moved to use OpenBitSetFilter and that is probably what is being cached.
In some cases it might be better to cache a SortedVIntList instead.

Boolean logic on DocIdSetIterator is already available for Scorers (that inherit from DocIdSetIterator) in the search package. This is currently implemented by ConjunctionScorer, DisjunctionSumScorer,
ReqOptSumScorer and ReqExclScorer.
Boolean logic on BitSets is available in contrib/misc and contrib/queries

DisjunctionSumScorer calls score() on its subscorers before the score value actually needed.
This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as a superclass of DisjunctionSumScorer.

To fully implement non scoring queries a TermDocIdSetIterator will be needed, perhaps as a superclass of TermScorer.

The javadocs in org.apache.lucene.search using matching vs non-zero score:
I'll investigate this soon, and provide a patch when necessary.

An early version of the patches of LUCENE-584 contained a class Matcher,
that differs from the current DocIdSet in that Matcher has an explain() method.
It remains to be seen whether such a Matcher could be useful between
DocIdSet and Scorer.

The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
This was also discussed at another issue recently, so perhaps it is wortwhile to open a separate issue for this.

Skipping on a SortedVIntList is done using linear search, this could be improved by adding multilevel skiplist info much like in the Lucene index for documents containing a term.

One comment by me of 3 Dec 2008:

A few complete (test) classes are deprecated, it might be good to add the target release for removal there.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Paul Elschot added a comment - 23/Feb/08 10:36 AM
I did something wrong here, I wanted to review the text above before posting it.
I'm sorry about that, I'll just continue here, when it gets too messy, another jira issue can easily be opened.

Paul Elschot added a comment - 23/Feb/08 10:43 AM
Iirc the boolean logic on contrib/queries is defined in two places: ChainedFilter and BooleanFilter. Ideally these could me merged and their functions be implemented by the DocIdSetIterators underlying the current scorers used by BooleanScorer2 (Conjunction/Disjunction/ReqOpt/ReqExcl). See also the comments of Micheal Bush and Eks Dev at the end of Jan 2008 at LUCENE-584.

That's it for the moment. Sorry for the mess, at least it should save others from reviewing all the comments at LUCENE-584. I hope I have not missed anything...


Eks Dev added a comment - 24/Feb/08 08:16 PM
Paul, I think there is one CHEKME in DisjunctionSumScorer I have stumbled upon recently when I realized
(token1+ token2+) query works way faster than (token1 token2).setMinimumSholdMatch(2). It is not directly related to the LUCENE-584, but just as a reminder.

also I think there is a hard_to_detect_small_maybe_performance_bug in ConjuctionScorer, :

// If first-time skip distance is any predictor of
    // scorer sparseness, then we should always try to skip first on
    // those scorers.
    // Keep last scorer in it's last place (it will be the first
    // to be skipped on), but reverse all of the others so that
    // they will be skipped on in order of original high skip.
    int end=(scorers.length-1)-1;
    for (int i=0; i<(end>>1); i++) {
      Scorer tmp = scorers[i];
      scorers[i] = scorers[end-i];
      scorers[end-i] = tmp;
    }

It has not been detected so far as it has only performance implications (I think?), and it sometimes works and sometimes not, depending on number of scorers:

to see what I am talking about, try this "simulator":

public static void main(String[] args) {
    int[] scorers = new int[7]; //3 and 7 do not work
   
    for (int i=0; i<scorers.length; i++) {
      scorers[i]=i;
    }
   
    System.out.println(Arrays.toString(scorers));
   
   
    int end=(scorers.length-1)-1;
    for (int i=0; i<(end>>1); i++) {
      int tmp = scorers[i];
      scorers[i] = scorers[end-i];
      scorers[end-i] = tmp;
    }

    System.out.println(Arrays.toString(scorers));

  }

for 7 you get:
[0, 1, 2, 3, 4, 5, 6]
[5, 4, 2, 3, 1, 0, 6]

instead of [5, 4, 3, 2, 1, 0, 6]

and for 3
[0, 1, 2]
[0, 1, 2] (should be [1, 0, 2])


Paul Elschot added a comment - 24/Feb/08 08:42 PM
Eks,

As both issues you mention are not related to filters, could you open a new issue for each of them?
For the first issue: iirc BooleanScorer2 will use a ConjunctionScorer in the case when all clauses are actually required in a disjunction, so normal usage via BooleanQuery should not have a performance problem there.
The second issue is beyond me at the moment.

Regards,
Paul Elschot


Yonik Seeley added a comment - 24/Feb/08 10:46 PM

also I think there is a hard_to_detect_small_maybe_performance_bug in ConjuctionScorer

Yup, I introduced that feature + bug. I just committed a fix that makes the code do what the comments say (no correctness implications, just perhaps minor performance).


Paul Elschot added a comment - 26/Feb/08 09:30 PM
Attached javadocsZero2Match.patch that replaces the last few occurrences 'zero scoring' in the java docs of org.apache.lucene.search by 'matching'.

Mark Miller added a comment - 13/Mar/08 10:54 PM
Test that now fails with ChainedFilter.
java.lang.ClassCastException: org.apache.lucene.util.OpenBitSet cannot be cast to java.util.BitSet
	at org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:55)
	at org.apache.lucene.misc.ChainedFilter.doChain(ChainedFilter.java:258)
	at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:193)
	at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:156)
	at org.apache.lucene.search.Filter.getDocIdSet(Filter.java:49)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:141)

Michael Busch added a comment - 14/Mar/08 07:04 AM

Test that now fails with ChainedFilter.

The reason apparently is that the core moved from BitSets to OpenBitSets,
whereas the contrib packages haven't.

If we change the contrib packages to also use OpenBitSets, then this is
still not completely backwards compatible. For example, if a user upgrades
to 2.4, uses a ChainedFilter to combine a 2.4 core filter with their own
custom Filter that is based on 2.3 and thus uses a BitSet, then it won't
work. So a simple drop-in replacement with the new lucene jar would not
be possible, the user would have to change their own filters.

Maybe we should introduce a DocIdSetFactory in the core? For
backwards compatibility a factory that produces BitSets can be used,
for speed one that creates OpenBitSets. Thoughts?


Eks Dev added a comment - 14/Mar/08 09:06 AM
Michael,
I do not think we need to add Factory (for this particular reason), DocIdSet type should not be assumed as we could come up with smart ways to select optimal Filter representation depending on doc-id distribution, size...

The only problem we have with is that contrib classes, ChainedFilter and BooleanFilter assume BitSet.
And the solution for this would be to add just a few methods to the DocIdSet that are able to do AND/OR/NOT on DocIdSet[] using DocIdSetIterator()
e.g.
DocIdSet or(DocIdSet[], int minimumShouldMatch);
DocIdSet or(DocIdSet[]);

Optimized code for these basic operations already exists, can be copied from Conjunction/Disjunction/ReqOpt/ReqExcl Scorer classes by just simply stripping-off scoring part.

with these utility methods in DocIdSet, rewriting ChainedFilter/BooleanFilter to work with DocIdSet (and that works on all implementations of Fileter/DocIdSet) is 10 minutes job... than, if needed this implementation can be optimized to cover type specific cases. Imo, BoolenFilter is better bet, we do not need both of them.

Unfortunately I do not have time to play with it next 3-4 weeks, but should be no more than 2 days work (remember, we have difficult part already done in Scorers). Having so much code duplication is not something really good, but we can then later "merge" these somehow.


Paul Elschot added a comment - 22/Mar/08 05:42 PM
I started adding getDocIdSet() to BooleanFilter of contrib/queries. When trying to collect the interim results into an OpenBitSet I soon needed OpenBitSet.conjunction(DocIdSetIterator), as well as similar disjunction() and exclusion() methods.

Would it be ok to add such methods to Lucene's OpenBitSet, or would it be preferable to subclass OpenBitSet for this? At first sight I prefer subclassing, but I'd like to hear some opinions on this before going further.


Paul Elschot added a comment - 22/Mar/08 09:44 PM
The OpenBitSetDISI-20080322.patch illustrates my previous question.
DISI means DocIdSetIterator, for want of a better name.
The patch compiles, but is untested.

Paul Elschot added a comment - 22/Mar/08 10:00 PM
For the record: the and() method in the patch misses a last statement: clear(index, size());

Paul Elschot added a comment - 25/Mar/08 07:48 PM
BooleanFilter20080325.patch contains a non BitSet version of BooleanFilter. The contrib/queries tests pass, and it includes a finished version of OpenBitSetDISI of a few days ago.
I've also changed the indentation of all of BooleanFilter and added some minor refactorings. This makes the patch itself somewhat less easy to read, but I couldn't leave the indendation in different styles.

Paul Elschot added a comment - 25/Mar/08 09:35 PM
All tests pass, except the one for ChainedFilter provided here by Mark Miller. Stay tuned.

Paul Elschot added a comment - 25/Mar/08 10:31 PM
The Contrib20080325 also includes Mark Miller's test patch, and a non BitSet version of ChainedFilter. It passes the tests contrib/miscellaneous, no time left to run all tests.

Contrib20080325.patch should supersede all patches currently here, except the javadoc patch.

There are no type checks yet to use the inter OpenBitSet boolean operations directly, but at least it should work. Remember to profile before adding such optimisations, for sparse OpenBitSets this could well be competitive. Well, I'd hope so.


Paul Elschot added a comment - 26/Mar/08 06:23 PM
Contrib20080326.patch: supersedes the 20080325 version. Generally the same as yesterday, some extensions:
  • fix a possible synchronisation issue by using a local int[1] array instead of an object int attribute,
  • return a SortedVIntList when it is definitely smaller than an OpenBitSet, the method doing this is protected.
  • all constructors in OpenBitSetDISI now also take a initial size argument
    (still called maxSize, perhaps better renamed to initialSize).

Both ChainedFilter and BooleanFilter should work normally, except perhaps using less memory because of the SortedVIntList.

ChainedFilter still has the 1.1 ASL, it's probably time to upgrade it, but I did not change it in the patch.


Michael Busch added a comment - 26/Mar/08 08:02 PM
Thanks for your patches, Paul. I'll be traveling the next days, but I'll try to look at the patches next week.

Paul Elschot added a comment - 29/Mar/08 08:36 AM
One thing: I added the max size parameter to the OpenBitSetDISI ctor rather late, so there is probably some room to use more of the fast... bit access methods of OpenBitSet.

Paul Elschot added a comment - 27/Apr/08 07:19 PM
Contrib20080427.patch is the same as the previous one from March, except for OpenBitSetDISI: added javadocs there and use fast... bit access methods consistently.

Mark Harwood added a comment - 13/May/08 10:16 PM
Paul,
Good work.
Just tried the patch and ran some pre and post-patch benchmarks.

I wanted to measure the overhead of :
the new OpenBitSetDISI.inPlaceOr(DocIdSetIterator)
vs
the previous scheme of BitSet.or(BitSet).

My test was on the biggest index I have here which was 3 million Wikipedia docs. I had 2 cached TermFilters on very popular terms (500k docs in each) and was measuring the cost of combining these as 2 "shoulds" in a BooleanFilter.
The expectation was the new scheme would add some overhead in extra method calls.

The average cost of iterating across BooleanFilter.getDocIdSet() was:

old BitSet scheme: 78 milliseconds
new DISI scheme: 156 milliseconds.

To address this I tried adding this optimisation into BooleanFilter...

DocIdSet dis = ((Filter)shouldFilters.get).getDocIdSet(reader);
if(dis instanceof OpenBitSet)

{ res.or((OpenBitSet) dis); // go-faster method }

else

{ res.inPlaceOr(getDISI(shouldFilters, i, reader)); //your patch code }

Before I could benchmark this I had to amend TermsFilter to use OpenBitSet rather than plain old BitSet

avg speed of your patch with OpenBitSet-enabled TermFilter : 100 milliseconds
avg speed of your patch with OpenBitSet-enabled TermFilter and above optimisation : 70 milliseconds

I'll try and post a proper patch when I get more time to look at this...

Cheers,
Mark


Paul Elschot added a comment - 13/May/08 10:54 PM
That sounds like the overhead of the DocIdSetIterator is never more than directly using an OpenBitSet in the dense cases that you tested so far (1 in 6 is more than 1 bit per byte).

That means that a DocIdSetIterator should have acceptable performance in the sparse case when it uses a SortedVIntList underneath. Could you share some test results for sparse cases as well? I'd expect it to outperform OpenBitSet even at CPU time when it is sparse enough.


Michael Busch added a comment - 22/May/08 06:30 AM
Good benchmarking, Mark.

I'm actually wondering about the performance of BooleanFilter, when used
with a mix of Filters where some use OpenBitSets and others use BitSets.

This will happen when users upgrade to Lucene 2.4 (core filters use
OpenBitSets now) and keep using their own custom Filters that use
BitSets.

Your patch, Paul, makes this combination possible, and thus guarantees
backwards-compatibility of BooleanFilter and ChainedFilter, which is
great! I'm just wondering if those user might encounter bad performance
surprises?


Paul Elschot added a comment - 22/May/08 07:08 AM
I would not expect bad performance problems mixing OpenBitSet and BitSet for Filters using this patch, although some performance may be lost. However, if a problem surfaces, the solution is to upgrade the existing Filter from BitSet to OpenBitSet, which amounts to the expected work after deprecation of BitSet in Filter.
A direct implementation of the bit manipulation operations on an OpenBitSet from a BitSet would probably be faster (taking the DocIdSetIterator out of the loop), but at the moment I see no good reason to implement that.

Michael Busch added a comment - 22/May/08 07:16 AM

the solution is to upgrade the existing Filter from BitSet to OpenBitSet, which amounts to the expected work after deprecation of BitSet in Filter.

I agree.


Michael Busch added a comment - 22/May/08 08:15 AM
The patches look good to me, Paul!

I'm attaching a new file that contains both your patches Contrib20080427.patch
and javadocsZero2Match.patch. I also added the optimizations for OpenBitSets
that Mark suggested to BooleanFilter and ChainedFilter.
And I added the check (disi.doc() < size()) to OpenBitSetDISI.inPlaceOr() and
OpenBitSetDISI.inPlaceXor().

All unit tests pass, however I think they don't cover all code paths now. We
should test both the ChainedFilter and the BooleanFilter on OpenBitSet-Filters
only as well as on combinations of different filters.

Paul, maybe you can help me with adding those tests? Then I would go ahead
and commit this patch soon. Otherwise I probably won't have time before next
week.


Paul Elschot added a comment - 22/May/08 03:53 PM
With the size tests added in OpenBitSetDISI, the javadocs of the changed methods could also be relaxed.

The filter tests in the contrib modules misc and queries look fairly complete to me in their current state. Did I overlook anything there? I don't have a coverage test tool here.


Michael Busch added a comment - 22/May/08 10:57 PM
I added a test helper class called OldBitSetFilterWrapper that helps to
test compatibility with old filters based on BitSets.

I changed ChainedFilterTest and BooleanFilterTest to run all tests on
new (OpenBitSet) and old (BitSet) filters to test both different code paths
that we have for OpenBitSet and DocIdSetIterator.

I verified with a code coverage tool that now all those paths are covered
and all tests pass.

Would be nice if you could quickly review the patch, Paul. If you're ok
with it then I'll commit it tomorrow.


Paul Elschot added a comment - 23/May/08 07:33 AM
Thanks for the backward compatibility additions in the tests.

One nice little detail: the patch contains this in OldBitSetFilterWrapper:

+      BitSet bits = new BitSet(reader.maxDoc());
+      DocIdSetIterator it = filter.getDocIdSet(reader).iterator();
...

but I expected:

BitSet bits = filter.bits(reader); // use deprecated method

On old filters both will versions will work, but the alternative makes it explicit that the filter must be an 'old' one.
Using the deprecated method would have the advantage that it (the whole wrapper class in fact) would have to be removed in 3.0.


Michael Busch added a comment - 23/May/08 07:44 AM

Using the deprecated method would have the advantage that it (the whole wrapper class in fact) would have to be removed in 3.0.

Thanks for reviewing! You're right, I will change it to use the deprecated method and also deprecate the wrapper class itself.


Michael Busch added a comment - 23/May/08 07:27 PM
Committed with mentioned changes to OldBitSetFilterWrapper.

Thanks Paul!


Paul Elschot added a comment - 23/May/08 07:45 PM
I missed TermsFilter initially, so I had another look there. It could use sth like this:
/** Provide a SortedVIntList when it is definitely smaller than an OpenBitSet */
   protected DocIdSet finalResult(OpenBitSetDISI result, int maxDocs) {
       return (result.cardinality() < (maxDocs / 9))
             ? (DocIdSet) new SortedVIntList(result)
             : (DocIdSet) result;
   }

But that would leave three copies this finalResult method in the patch, which is just beyond my refactoring tolerance level.
Perhaps this method could move to a static method in the o.a.l.search.DocIdSet class under a better name, sth like defaultDocIdSet(), or into a new helper class o.a.l.util.DefaultDocIdSet, to prepare for the availability of better implementations in the future.


Paul Elschot added a comment - 23/May/08 07:47 PM
And in that case, the first argument could also be changed from OpenBitSetDISI to OpenBitSet.

Michael Busch added a comment - 23/May/08 08:11 PM
Do we actually know about the performance of SortedVIntList?

I'm a little worried, because it doesn't have a skip list.


Paul Elschot added a comment - 23/May/08 08:15 PM
OpenBitSet does not have a skip list either, so I'd expect SortedVIntList to be faster when the underlying set is sparse enough.

As I missed the commit, I'll provide a patch for my latest comments in the next few days.


Michael Busch added a comment - 23/May/08 08:29 PM

I'll provide a patch for my latest comments in the next few days

Sounds good!


Paul Elschot added a comment - 25/May/08 06:56 PM
While considering DefaultDocIdSet as a class, I thought that perhaps a better way would be to add a method to class Filter that takes the usual DocIdSet and provides the DocIdSet that should be used for caching, for example in CachingWrapperFilter. Sth like this:
public class Filter {
  ... the abstract bits() deprecated method ... ;

  public DocIdSet getDocIdSet(IndexReader reader) {
    // unchanged implementation for now. to become abstract later.
  }

  public DocIdSet getDocIdSetForCache(IndexReader reader) {
    // Use a default implementation here that provides a tradeoff for caching,
    // fairly compact when possible, but still fast.
    // For the moment this could be close to the code of the finalResult() method
    // mentioned above:
    DocIdSet result = getDocIdSet(reader);
   
    if (!(result instanceof SortedVIntList)) 
                and (result.cardinality() < (reader.maxDoc() / 9))) {
      return  new SortedVIntList(result);
    }
    return result;
  }

(One minor problem with this is that DocIdSet does not have a cardinality() method
and that SortedVIntList does not have a constructor for a DocIdSet.)

The question is: how about adding such a getDocIdSetForCache() method to Filter?
Or is there a better place for this functionality, for example in CachingWrapperFilter?


Paul Elschot added a comment - 29/May/08 03:22 PM
As this has had some time so settle, I think the cache should decide what it wants to store.
That means I'm in favour of changing CachingWrapperFilter to let it decide which DocIdSet implementation to cache, sth. like this:
protected DocIdSet docIdSetToCache(DocIdSet dis) { ... }

where dis is the result of getDocIdSet(reader) on the wrapped Filter.
At the same time the protected finalResult() methods in the contrib code could be removed.

Comments?


Michael Busch added a comment - 29/May/08 04:45 PM
Sounds good to me Paul.

Could you open a separate issue and attach a patch?


Paul Elschot added a comment - 29/May/08 08:11 PM
After the commit here, I'm opening a new issue for filter caching.

Paul Elschot added a comment - 03/Jun/08 07:08 PM
Just to be complete, the new issue for filter caching is LUCENE-1296