Issue Details (XML | Word | Printable)

Key: LUCENE-328
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Duplicate
Priority: Minor Minor
Assignee: Unassigned
Reporter: Paul Elschot
Votes: 5
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Some utilities for a compact sparse filter

Created: 04/Jan/05 12:38 AM   Updated: 28/Jun/06 01:38 AM
Return to search
Component/s: Search
Affects Version/s: CVS Nightly - Specify date in submission
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File AndDocNrSkipper.java 2005-06-21 07:52 PM Mark Harwood 2 kB
Java Source File AndDocNrSkipper.java 2005-06-21 06:21 PM Mark Harwood 2 kB
Java Source File BitSetSortedIntList.java 2005-06-21 06:20 PM Mark Harwood 1 kB
Java Source File DocNrSkipper.java 2005-02-09 04:43 AM Paul Elschot 1 kB
Java Source File DocNrSkipper.java 2005-01-17 02:00 AM Paul Elschot 1 kB
Java Source File Licensed for inclusion in ASF works IntArraySortedIntList.java 2005-11-22 07:07 AM Mark Harwood 3 kB
Java Source File IntArraySortedIntList.java 2005-06-21 06:19 PM Mark Harwood 3 kB
Java Source File OrDocNrSkipper.java 2005-06-21 07:53 PM Mark Harwood 2 kB
Java Source File OrDocNrSkipper.java 2005-06-21 06:20 PM Mark Harwood 2 kB
Text File Licensed for inclusion in ASF works SkipFilter1.patch 2006-05-15 11:17 PM Paul Elschot 4 kB
Java Source File SortedVIntList.java 2005-02-09 04:44 AM Paul Elschot 4 kB
Java Source File SortedVIntList.java 2005-01-17 02:04 AM Paul Elschot 4 kB
Java Source File SortedVIntList.java 2005-01-04 12:40 AM Paul Elschot 4 kB
Java Source File TestDocNrSkippers.java 2005-06-21 07:53 PM Mark Harwood 6 kB
Java Source File TestDocNrSkippers.java 2005-06-21 06:21 PM Mark Harwood 6 kB
Java Source File TestSortedVIntList.java 2005-02-09 04:46 AM Paul Elschot 4 kB
Java Source File TestSortedVIntList.java 2005-01-17 02:06 AM Paul Elschot 4 kB
Java Source File TestSortedVIntList.java 2005-01-06 03:21 AM Paul Elschot 4 kB
Environment:
Operating System: other
Platform: Other
Issue Links:
Reference
 

Bugzilla Id: 32921
Resolution Date: 28/Jun/06 01:38 AM


 Description  « Hide
Two files are attached that might form the basis for an alternative
filter implementation that is more memory efficient than one bit
per doc when less than about 1/8 of the docs pass through the filter.

The document numbers are stored in RAM as VInt's from the Lucene index
format. These VInt's encode the difference between two successive
document numbers, much like a PositionDelta in the Positions:
http://jakarta.apache.org/lucene/docs/fileformats.html

The getByteSize() method can be used to verify the compression
once a SortedVIntList is constructed.
The precise conditions under which this is more memory efficient than
one bit per document are not easy to specify in advance.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Paul Elschot added a comment - 04/Jan/05 12:40 AM
Created an attachment (id=13878)
SortedVIntList.java

Paul Elschot added a comment - 04/Jan/05 12:41 AM
Created an attachment (id=13879)
intIterator.java

Mark Harwood added a comment - 05/Jan/05 07:25 PM
(From update of attachment 13878)
public SortedVIntList(BitSet bits) loops indefinitely. A "+1" is needed on the
paramater to the last "nextSetBit" call.

Mark Harwood


Paul Elschot added a comment - 05/Jan/05 10:34 PM
Mark,

Thanks. I did not test BitSet constructor.
I'll finish the test code and add that and
your correction later.

Paul Elschot


Paul Elschot added a comment - 06/Jan/05 03:21 AM
Created an attachment (id=13901)
JUnit test case for SortedVIntList

The only change needed to SortedVIntList to pass the tests
is the "+ 1" in the BitSet constructor as Mark noted:

public SortedVIntList(BitSet bits) {
initBytes();

int lastInt = 0;
int nextInt = bits.nextSetBit(lastInt);
while (nextInt != -1) { add(nextInt, lastInt); lastInt = nextInt; nextInt = bits.nextSetBit(lastInt + 1); // correction here }

resizeBytes(lastBytePos);
}

so I'm not attaching the corrected version.

The test code suppresses the test BitSet test for
some special cases: too large integers that cause
too much memory used in the BitSet, and duplicates
in the list of integers that cannot be represented
in a BitSet.

I'll try and not post untested code again

Regards, and thanks again Mark,
Paul Elschot


Paul Elschot added a comment - 17/Jan/05 02:00 AM
Created an attachment (id=14020)
DocNrSkipper.java to replace intIterator.java

Implementors of this interface can be used for SkipFilter


Paul Elschot added a comment - 17/Jan/05 02:04 AM
Created an attachment (id=14021)
SortedVIntList.java providing a DocNrSkipper instead of an intIterator

Basic data structure of differences stored as VInts unchanged.


Paul Elschot added a comment - 17/Jan/05 02:06 AM
Created an attachment (id=14022)
TestSortedVIntList.java, adapted for DocNrSkipper

Paul Elschot added a comment - 09/Feb/05 04:43 AM
Created an attachment (id=14210)
Assigned copyright to ASF

Paul Elschot added a comment - 09/Feb/05 04:44 AM
Created an attachment (id=14211)
Assigned copyright to ASF

Paul Elschot added a comment - 09/Feb/05 04:46 AM
Created an attachment (id=14213)
TestSortedVIntList.java, assigned copyright to ASF

Mark Harwood added a comment - 21/Jun/05 01:51 AM
Paul,
What was the status with this code - its been a while since you posted it.
I've got some extensions which could be added here:
  • BitSet and IntArray based implementations of DocNrSkipper
  • OrDocNoSkipper and AndDocNoSkipper wrapper implementations which both take 2
    DocNrSkippers in their constructor and offer "OR" or "AND" based iteration
    across the sets' intersections. These are DocNrSkipper implementations too so
    can be arbitrarily nested within each other (like BooleanQuery does).
    Cheers
    Mark

Paul Elschot added a comment - 21/Jun/05 05:14 AM
(In reply to comment #12)
> Paul,
> What was the status with this code - its been a while since you posted it.

It's working nicely for me.

> I've got some extensions which could be added here:
> * BitSet and IntArray based implementations of DocNrSkipper

These would be useful because they have different tradeoffs in memory use
and runtime.

> * OrDocNoSkipper and AndDocNoSkipper wrapper implementations which both take
2
> DocNrSkippers in their constructor and offer "OR" or "AND" based iteration
> across the sets' intersections. These are DocNrSkipper implementations too so
> can be arbitrarily nested within each other (like BooleanQuery does).

In case you prefer DocNoSkipper or something else I don't mind.
Using both DocNo and DocNr in names will cause confusion, though.

The OrDocNoSkipper and AndDocNoSkipper would be good to implement
constant scoring queries. The top level OR iterator could then be
implemented by iterating on the DocNrSkippers separately into
a BitSet.
Did you also consider more than two arguments to these AND and OR
iterators?
For completeness a "NOT" based iteration
would also be needed, but that can be added when needed.

I'm interested in extending QueryParser to use these
mechanisms on 'boolean' fields like dates and primary keys.

To easily implement boolean operations with a constant scoring term,
a "close relative" of TermScorer might directly implement the
Doc{No/Nr}Skipper interface. Any opinions on a:

class TermConstantScorer
extends Scorer
implements DocNrSkipper {
...
}

?

It might also be good to put all new things related
to skipping filters and constant scoring queries
in (a) new subpackage(s) of org.apache.lucene.search.

Regards,
Paul Elschot


Mark Harwood added a comment - 21/Jun/05 06:19 PM
Created an attachment (id=15483)
IntArraySortedIntList

Mark Harwood added a comment - 21/Jun/05 06:20 PM
Created an attachment (id=15484)
BitSetSortedIntList

Mark Harwood added a comment - 21/Jun/05 06:20 PM
Created an attachment (id=15485)
OrDocNrSkipper

Mark Harwood added a comment - 21/Jun/05 06:21 PM
Created an attachment (id=15486)
AndDocNrSkipper

Mark Harwood added a comment - 21/Jun/05 06:21 PM
Created an attachment (id=15487)
TestDocNrSkippers

Mark Harwood added a comment - 21/Jun/05 06:41 PM
>>Did you also consider more than two arguments to these AND and OR
iterators?

The ability to nest one OrDocNrSkipper in another is effectively a way of
achieving the same result. Maybe a static helper method could be useful to
construct these nested objects, given an array of DocNrSkippers that need to be
Or-ed together.
Thinking about it though, there is an advantage to a "flat" arrangement rather
than a "nested" approach in the AND iterator. In a flat scheme the skipTo calls
will always be optimized to jumping to the largest docNr in the whole group
whereas a nested arrangement may make multiple skip calls within each nested
pair to achieve the same result. I'll look at redoing the AndDocNrSkipper with
this in mind.

I'm using these classes for things other than query scoring at the moment so I
don't have an immediate answer to your questions on their possible application
in scoring - I'll have to have a dig around the scoring code some more.


Mark Harwood added a comment - 21/Jun/05 07:52 PM
Created an attachment (id=15488)
AndDocNrSkipper with support for iterating multiple DocNrSkippers

Mark Harwood added a comment - 21/Jun/05 07:53 PM
Created an attachment (id=15489)
OrDocNrSkipper with support for iterating multiple DocNrSkippers

Mark Harwood added a comment - 21/Jun/05 07:53 PM
Created an attachment (id=15490)
TestDocNrSkippers

Jeff Turner made changes - 03/Sep/05 03:28 PM
Field Original Value New Value
issue.field.bugzillaimportkey 32921 12314478
Yonik Seeley added a comment - 23/Sep/05 03:47 AM
How about adding a next() or nextDocNr() to DocNrSkipper that doesn't take the current id as a parameter? It would allow more efficient implementations of DocNrSkipper when you want to simply iterate over all the docs.

It would also make some code using it a little cleaner.


Paul Elschot added a comment - 16/Oct/05 12:22 AM
About adding a nextDocNr() without current doc argument to DocNrSkipper:
I considered that but left it out initially for code simplicity in DocNrSkipper implementations.

It's much the same as with Scorer.next() and Scorer.skipTo(docNr), so it would fit
in the environment of Lucene to add nextDocNr() without argument to DocNrSkipper.
In case someone has a real performance advantage of such an addition, it would
be worthwhile to have.

Regards,
Paul Elschot


Yonik Seeley added a comment - 16/Oct/05 12:30 AM
I've been working a little on a faster version of BitSet. That's one place where a stateful iterator implementing nextDocNr() can be faster than nextDocNr(docNr) , so I would like to see the nextDocNr() added.

Mark Harwood added a comment - 22/Nov/05 07:07 AM
Removed unused code

Mark Harwood made changes - 22/Nov/05 07:07 AM
Attachment IntArraySortedIntList.java [ 12320844 ]
Eks Dev added a comment - 30/Dec/05 12:10 AM
I've been looking at this code and found some minor enhancements that could be done:

1. Any particular reason for SortedVIntList not to implement DocNrSkipper interface, the method getDocNrSkipper() is there, but declaration is missing.

2. Should getDocNrSkipper() DocNrSkipper interface throw IOException? I have tried to add TermDocsSortedIntList to the family, but all methods in TermDocs are throwing IOException, and it is not nice to eat silently this exception too early in DocNrSkipper. Better ideas to deal with that?

3. Paul, why SkipFilter exists (here I refer to the JIRA-330 )? Wouldn't be better to use DocNrSkipper interface instead (SkipFilter does nothing but wrapping this interface). Also, the same question applies to IterFilter. Did I get something wrong here?

Must say, excelent work!
A lot of use cases related to Filtering and non-scoring queries can be done efficiently with this


Paul Elschot added a comment - 02/Jan/06 05:33 AM
> 1. Any particular reason for SortedVIntList not to implement DocNrSkipper interface, the method getDocNrSkipper() is there, but declaration is missing.

The object returned by the getDocNrSkipper() method implements the interface by adding a bit of state
for the iteration over the document numbers. This allows more than one iterator on the (non modifiable)
SortedVIntList.

> 2. Should getDocNrSkipper() DocNrSkipper interface throw IOException? I have tried to add TermDocsSortedIntList to the family, but all methods in TermDocs are throwing IOException, and it is not nice to eat silently this exception too early in DocNrSkipper. Better ideas to deal with that?

I have no problem with the addition of throwing an IOException to the DocNrSkipper interface.
The idea is to filter query results from RAM from which one would not normally expect
an IOException , so one could also consider rethrowing the IOException wrapped in an Error.
OTOH, the ability to obtain a DocNrSkipper directly from an index is certainly appealing,
and then IOException is unavoidable.

> 3. Paul, why SkipFilter exists (here I refer to the JIRA-330 )? Wouldn't be better to use DocNrSkipper interface instead (SkipFilter does nothing but wrapping this interface). Also, the same question applies to IterFilter. Did I get something wrong here?

The presence of class BitSet in the bits() method of Filter
makes it impossible to provide another implementation of a Filter.
SkipFilter/DocNrSkipper are intended to be parallel to Filter/BitSet,
and the DocNrSkipper interface allows alternative implementations.
Both SkipFilter and Filter are interfaces for factories/caches of for filter data structures.

I'd like to somehow have these parallel paths merged, but I don't now how to
do that. Perhaps SkipFilter could allow backward compatibility by also
providing a bits() method, and use that method when it does not throw for
example an UnsupportedOperationException.
Another way would be to deprecate Filter in favour of SkipFilter, but that would
have a lot of backward compatibility issues, and perhaps also some
performance issues.
The FilteredQuery of LUCENE-330 allows for both paths to be used,
both paths are joined at line 211 in this FilteredQuery.

The IterFilter of LUCENE-330 was replaced by SkipFilter, I forgot
to indicate that when I downloaded the replacement. I have just deleted
IterFilter there.

> Must say, excelent work!

Thanks. I should add that most of the hard work had already been done in
org.apache.lucene.store.InputStream.readVInt() and
org.apache.lucene.store.OutputStream.writeVInt().

Regards,
Paul Elschot


Paul Elschot made changes - 03/Jan/06 07:14 AM
Attachment intIterator.java [ 12312473 ]
Paul Elschot added a comment - 15/May/06 11:17 PM
This patches Filter.java and IndexSearcher.java .

Filter.java is modified to implement SkipFilter, to provide
a first step in a backward compatible way to slowly make Filter
independent of BitSet.
IndexSearcher.java is modified to test for a DocNrSkipper from
a given Filter, and to use that. In that case also skipTo() is used
on the scorer of the query being filtered.
This patch requires org.apache.lucene.util.DocNrSkipper, which
available at this issue.
Also required is org.apache.lucene.search.SkipFilter, which
is available at LUCENE-330.

The patch also contains some commented test code for Filter.java.
This test code always provides a DocNrSkipper (from the BitSet).
With and without this test code, all tests pass here.

When extending Filter in this way, SkipFilter may not be necessary
at all. I left it in to allow a path forward to complete independence
from BitSet.
In case SkipFilter stays, it would be good to add (a) new method(s)
to IndexSearcher allowing a SkipFilter to filter a query.

The DocNrSkipper interface contains only one method:
nextDocNr(int docNr).
It may be good for performance to also add a nextDocNr()
method without any argument, much like skipTo(target) and next()
on Scorer. IOW, I do not consider DocNrSkipper stable at this moment.

I don't think this patch should be added to release 2.0.


Paul Elschot made changes - 15/May/06 11:17 PM
Attachment SkipFilter1.patch [ 12331912 ]
Paul Elschot added a comment - 16/May/06 02:28 AM
Starting from SkipFilter1.patch as above, a replacement of Filter by SkipFilter in the various API's
(Searcher, Searchable and implementors) is straightforward. The only thing further needed is a
checked cast to Filter in IndexSearcher.search(weight, filter, hitcollector) for the case when
the DocNrSkipper is null. (When that cast to Filter fails an IllegalArgumentException can be thrown).
After that, all tests pass again, with and without the test code in Filter to always use a DocNrSkipper.

That means that it is easier than expected to replace Filter by SkipFilter altogether.


Paul Elschot made changes - 27/Jun/06 05:09 AM
Link This issue is related to LUCENE-584 [ LUCENE-584 ]
Paul Elschot added a comment - 28/Jun/06 12:59 AM
This issue is outdated now. The related LUCENE-584 has a new implementation.

Otis Gospodnetic added a comment - 28/Jun/06 01:38 AM
Moved to LUCENE-584.
Those of you who voted for this (5 votes!) may want to vote for LUCENE-584 now.

Otis Gospodnetic made changes - 28/Jun/06 01:38 AM
Resolution Duplicate [ 3 ]
Assignee Lucene Developers [ java-dev@lucene.apache.org ]
Status Open [ 1 ] Resolved [ 5 ]