Lucene - Core
  1. Lucene - Core
  2. LUCENE-328

Some utilities for a compact sparse filter

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Operating System: other
      Platform: Other

      Description

      Two files are attached that might form the basis for an alternative
      filter implementation that is more memory efficient than one bit
      per doc when less than about 1/8 of the docs pass through the filter.

      The document numbers are stored in RAM as VInt's from the Lucene index
      format. These VInt's encode the difference between two successive
      document numbers, much like a PositionDelta in the Positions:
      http://jakarta.apache.org/lucene/docs/fileformats.html

      The getByteSize() method can be used to verify the compression
      once a SortedVIntList is constructed.
      The precise conditions under which this is more memory efficient than
      one bit per document are not easy to specify in advance.

      1. ASF.LICENSE.NOT.GRANTED--AndDocNrSkipper.java
        2 kB
        Mark Harwood
      2. ASF.LICENSE.NOT.GRANTED--AndDocNrSkipper.java
        2 kB
        Mark Harwood
      3. ASF.LICENSE.NOT.GRANTED--BitSetSortedIntList.java
        1 kB
        Mark Harwood
      4. ASF.LICENSE.NOT.GRANTED--DocNrSkipper.java
        1 kB
        Paul Elschot
      5. ASF.LICENSE.NOT.GRANTED--DocNrSkipper.java
        1 kB
        Paul Elschot
      6. ASF.LICENSE.NOT.GRANTED--IntArraySortedIntList.java
        3 kB
        Mark Harwood
      7. ASF.LICENSE.NOT.GRANTED--OrDocNrSkipper.java
        2 kB
        Mark Harwood
      8. ASF.LICENSE.NOT.GRANTED--OrDocNrSkipper.java
        2 kB
        Mark Harwood
      9. ASF.LICENSE.NOT.GRANTED--SortedVIntList.java
        4 kB
        Paul Elschot
      10. ASF.LICENSE.NOT.GRANTED--SortedVIntList.java
        4 kB
        Paul Elschot
      11. ASF.LICENSE.NOT.GRANTED--SortedVIntList.java
        4 kB
        Paul Elschot
      12. ASF.LICENSE.NOT.GRANTED--TestDocNrSkippers.java
        6 kB
        Mark Harwood
      13. ASF.LICENSE.NOT.GRANTED--TestDocNrSkippers.java
        6 kB
        Mark Harwood
      14. ASF.LICENSE.NOT.GRANTED--TestSortedVIntList.java
        4 kB
        Paul Elschot
      15. ASF.LICENSE.NOT.GRANTED--TestSortedVIntList.java
        4 kB
        Paul Elschot
      16. ASF.LICENSE.NOT.GRANTED--TestSortedVIntList.java
        4 kB
        Paul Elschot
      17. IntArraySortedIntList.java
        3 kB
        Mark Harwood
      18. SkipFilter1.patch
        4 kB
        Paul Elschot

        Issue Links

          Activity

          Hide
          Paul Elschot added a comment -

          Created an attachment (id=13878)
          SortedVIntList.java

          Show
          Paul Elschot added a comment - Created an attachment (id=13878) SortedVIntList.java
          Hide
          Paul Elschot added a comment -

          Created an attachment (id=13879)
          intIterator.java

          Show
          Paul Elschot added a comment - Created an attachment (id=13879) intIterator.java
          Hide
          Mark Harwood added a comment -

          (From update of attachment 13878)
          public SortedVIntList(BitSet bits) loops indefinitely. A "+1" is needed on the
          paramater to the last "nextSetBit" call.

          Mark Harwood

          Show
          Mark Harwood added a comment - (From update of attachment 13878) public SortedVIntList(BitSet bits) loops indefinitely. A "+1" is needed on the paramater to the last "nextSetBit" call. Mark Harwood
          Hide
          Paul Elschot added a comment -

          Mark,

          Thanks. I did not test BitSet constructor.
          I'll finish the test code and add that and
          your correction later.

          Paul Elschot

          Show
          Paul Elschot added a comment - Mark, Thanks. I did not test BitSet constructor. I'll finish the test code and add that and your correction later. Paul Elschot
          Hide
          Paul Elschot added a comment -

          Created an attachment (id=13901)
          JUnit test case for SortedVIntList

          The only change needed to SortedVIntList to pass the tests
          is the "+ 1" in the BitSet constructor as Mark noted:

          public SortedVIntList(BitSet bits) {
          initBytes();

          int lastInt = 0;
          int nextInt = bits.nextSetBit(lastInt);
          while (nextInt != -1)

          { add(nextInt, lastInt); lastInt = nextInt; nextInt = bits.nextSetBit(lastInt + 1); // correction here }

          resizeBytes(lastBytePos);
          }

          so I'm not attaching the corrected version.

          The test code suppresses the test BitSet test for
          some special cases: too large integers that cause
          too much memory used in the BitSet, and duplicates
          in the list of integers that cannot be represented
          in a BitSet.

          I'll try and not post untested code again

          Regards, and thanks again Mark,
          Paul Elschot

          Show
          Paul Elschot added a comment - Created an attachment (id=13901) JUnit test case for SortedVIntList The only change needed to SortedVIntList to pass the tests is the "+ 1" in the BitSet constructor as Mark noted: public SortedVIntList(BitSet bits) { initBytes(); int lastInt = 0; int nextInt = bits.nextSetBit(lastInt); while (nextInt != -1) { add(nextInt, lastInt); lastInt = nextInt; nextInt = bits.nextSetBit(lastInt + 1); // correction here } resizeBytes(lastBytePos); } so I'm not attaching the corrected version. The test code suppresses the test BitSet test for some special cases: too large integers that cause too much memory used in the BitSet, and duplicates in the list of integers that cannot be represented in a BitSet. I'll try and not post untested code again Regards, and thanks again Mark, Paul Elschot
          Hide
          Paul Elschot added a comment -

          Created an attachment (id=14020)
          DocNrSkipper.java to replace intIterator.java

          Implementors of this interface can be used for SkipFilter

          Show
          Paul Elschot added a comment - Created an attachment (id=14020) DocNrSkipper.java to replace intIterator.java Implementors of this interface can be used for SkipFilter
          Hide
          Paul Elschot added a comment -

          Created an attachment (id=14021)
          SortedVIntList.java providing a DocNrSkipper instead of an intIterator

          Basic data structure of differences stored as VInts unchanged.

          Show
          Paul Elschot added a comment - Created an attachment (id=14021) SortedVIntList.java providing a DocNrSkipper instead of an intIterator Basic data structure of differences stored as VInts unchanged.
          Hide
          Paul Elschot added a comment -

          Created an attachment (id=14022)
          TestSortedVIntList.java, adapted for DocNrSkipper

          Show
          Paul Elschot added a comment - Created an attachment (id=14022) TestSortedVIntList.java, adapted for DocNrSkipper
          Hide
          Paul Elschot added a comment -

          Created an attachment (id=14210)
          Assigned copyright to ASF

          Show
          Paul Elschot added a comment - Created an attachment (id=14210) Assigned copyright to ASF
          Hide
          Paul Elschot added a comment -

          Created an attachment (id=14211)
          Assigned copyright to ASF

          Show
          Paul Elschot added a comment - Created an attachment (id=14211) Assigned copyright to ASF
          Hide
          Paul Elschot added a comment -

          Created an attachment (id=14213)
          TestSortedVIntList.java, assigned copyright to ASF

          Show
          Paul Elschot added a comment - Created an attachment (id=14213) TestSortedVIntList.java, assigned copyright to ASF
          Hide
          Mark Harwood added a comment -

          Paul,
          What was the status with this code - its been a while since you posted it.
          I've got some extensions which could be added here:

          • BitSet and IntArray based implementations of DocNrSkipper
          • OrDocNoSkipper and AndDocNoSkipper wrapper implementations which both take 2
            DocNrSkippers in their constructor and offer "OR" or "AND" based iteration
            across the sets' intersections. These are DocNrSkipper implementations too so
            can be arbitrarily nested within each other (like BooleanQuery does).
            Cheers
            Mark
          Show
          Mark Harwood added a comment - Paul, What was the status with this code - its been a while since you posted it. I've got some extensions which could be added here: BitSet and IntArray based implementations of DocNrSkipper OrDocNoSkipper and AndDocNoSkipper wrapper implementations which both take 2 DocNrSkippers in their constructor and offer "OR" or "AND" based iteration across the sets' intersections. These are DocNrSkipper implementations too so can be arbitrarily nested within each other (like BooleanQuery does). Cheers Mark
          Hide
          Paul Elschot added a comment -

          (In reply to comment #12)
          > Paul,
          > What was the status with this code - its been a while since you posted it.

          It's working nicely for me.

          > I've got some extensions which could be added here:
          > * BitSet and IntArray based implementations of DocNrSkipper

          These would be useful because they have different tradeoffs in memory use
          and runtime.

          > * OrDocNoSkipper and AndDocNoSkipper wrapper implementations which both take
          2
          > DocNrSkippers in their constructor and offer "OR" or "AND" based iteration
          > across the sets' intersections. These are DocNrSkipper implementations too so
          > can be arbitrarily nested within each other (like BooleanQuery does).

          In case you prefer DocNoSkipper or something else I don't mind.
          Using both DocNo and DocNr in names will cause confusion, though.

          The OrDocNoSkipper and AndDocNoSkipper would be good to implement
          constant scoring queries. The top level OR iterator could then be
          implemented by iterating on the DocNrSkippers separately into
          a BitSet.
          Did you also consider more than two arguments to these AND and OR
          iterators?
          For completeness a "NOT" based iteration
          would also be needed, but that can be added when needed.

          I'm interested in extending QueryParser to use these
          mechanisms on 'boolean' fields like dates and primary keys.

          To easily implement boolean operations with a constant scoring term,
          a "close relative" of TermScorer might directly implement the
          Doc

          {No/Nr}

          Skipper interface. Any opinions on a:

          class TermConstantScorer
          extends Scorer
          implements DocNrSkipper

          { ... }

          ?

          It might also be good to put all new things related
          to skipping filters and constant scoring queries
          in (a) new subpackage(s) of org.apache.lucene.search.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - (In reply to comment #12) > Paul, > What was the status with this code - its been a while since you posted it. It's working nicely for me. > I've got some extensions which could be added here: > * BitSet and IntArray based implementations of DocNrSkipper These would be useful because they have different tradeoffs in memory use and runtime. > * OrDocNoSkipper and AndDocNoSkipper wrapper implementations which both take 2 > DocNrSkippers in their constructor and offer "OR" or "AND" based iteration > across the sets' intersections. These are DocNrSkipper implementations too so > can be arbitrarily nested within each other (like BooleanQuery does). In case you prefer DocNoSkipper or something else I don't mind. Using both DocNo and DocNr in names will cause confusion, though. The OrDocNoSkipper and AndDocNoSkipper would be good to implement constant scoring queries. The top level OR iterator could then be implemented by iterating on the DocNrSkippers separately into a BitSet. Did you also consider more than two arguments to these AND and OR iterators? For completeness a "NOT" based iteration would also be needed, but that can be added when needed. I'm interested in extending QueryParser to use these mechanisms on 'boolean' fields like dates and primary keys. To easily implement boolean operations with a constant scoring term, a "close relative" of TermScorer might directly implement the Doc {No/Nr} Skipper interface. Any opinions on a: class TermConstantScorer extends Scorer implements DocNrSkipper { ... } ? It might also be good to put all new things related to skipping filters and constant scoring queries in (a) new subpackage(s) of org.apache.lucene.search. Regards, Paul Elschot
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=15483)
          IntArraySortedIntList

          Show
          Mark Harwood added a comment - Created an attachment (id=15483) IntArraySortedIntList
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=15484)
          BitSetSortedIntList

          Show
          Mark Harwood added a comment - Created an attachment (id=15484) BitSetSortedIntList
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=15485)
          OrDocNrSkipper

          Show
          Mark Harwood added a comment - Created an attachment (id=15485) OrDocNrSkipper
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=15486)
          AndDocNrSkipper

          Show
          Mark Harwood added a comment - Created an attachment (id=15486) AndDocNrSkipper
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=15487)
          TestDocNrSkippers

          Show
          Mark Harwood added a comment - Created an attachment (id=15487) TestDocNrSkippers
          Hide
          Mark Harwood added a comment -

          >>Did you also consider more than two arguments to these AND and OR
          iterators?

          The ability to nest one OrDocNrSkipper in another is effectively a way of
          achieving the same result. Maybe a static helper method could be useful to
          construct these nested objects, given an array of DocNrSkippers that need to be
          Or-ed together.
          Thinking about it though, there is an advantage to a "flat" arrangement rather
          than a "nested" approach in the AND iterator. In a flat scheme the skipTo calls
          will always be optimized to jumping to the largest docNr in the whole group
          whereas a nested arrangement may make multiple skip calls within each nested
          pair to achieve the same result. I'll look at redoing the AndDocNrSkipper with
          this in mind.

          I'm using these classes for things other than query scoring at the moment so I
          don't have an immediate answer to your questions on their possible application
          in scoring - I'll have to have a dig around the scoring code some more.

          Show
          Mark Harwood added a comment - >>Did you also consider more than two arguments to these AND and OR iterators? The ability to nest one OrDocNrSkipper in another is effectively a way of achieving the same result. Maybe a static helper method could be useful to construct these nested objects, given an array of DocNrSkippers that need to be Or-ed together. Thinking about it though, there is an advantage to a "flat" arrangement rather than a "nested" approach in the AND iterator. In a flat scheme the skipTo calls will always be optimized to jumping to the largest docNr in the whole group whereas a nested arrangement may make multiple skip calls within each nested pair to achieve the same result. I'll look at redoing the AndDocNrSkipper with this in mind. I'm using these classes for things other than query scoring at the moment so I don't have an immediate answer to your questions on their possible application in scoring - I'll have to have a dig around the scoring code some more.
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=15488)
          AndDocNrSkipper with support for iterating multiple DocNrSkippers

          Show
          Mark Harwood added a comment - Created an attachment (id=15488) AndDocNrSkipper with support for iterating multiple DocNrSkippers
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=15489)
          OrDocNrSkipper with support for iterating multiple DocNrSkippers

          Show
          Mark Harwood added a comment - Created an attachment (id=15489) OrDocNrSkipper with support for iterating multiple DocNrSkippers
          Hide
          Mark Harwood added a comment -

          Created an attachment (id=15490)
          TestDocNrSkippers

          Show
          Mark Harwood added a comment - Created an attachment (id=15490) TestDocNrSkippers
          Hide
          Yonik Seeley added a comment -

          How about adding a next() or nextDocNr() to DocNrSkipper that doesn't take the current id as a parameter? It would allow more efficient implementations of DocNrSkipper when you want to simply iterate over all the docs.

          It would also make some code using it a little cleaner.

          Show
          Yonik Seeley added a comment - How about adding a next() or nextDocNr() to DocNrSkipper that doesn't take the current id as a parameter? It would allow more efficient implementations of DocNrSkipper when you want to simply iterate over all the docs. It would also make some code using it a little cleaner.
          Hide
          Paul Elschot added a comment -

          About adding a nextDocNr() without current doc argument to DocNrSkipper:
          I considered that but left it out initially for code simplicity in DocNrSkipper implementations.

          It's much the same as with Scorer.next() and Scorer.skipTo(docNr), so it would fit
          in the environment of Lucene to add nextDocNr() without argument to DocNrSkipper.
          In case someone has a real performance advantage of such an addition, it would
          be worthwhile to have.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - About adding a nextDocNr() without current doc argument to DocNrSkipper: I considered that but left it out initially for code simplicity in DocNrSkipper implementations. It's much the same as with Scorer.next() and Scorer.skipTo(docNr), so it would fit in the environment of Lucene to add nextDocNr() without argument to DocNrSkipper. In case someone has a real performance advantage of such an addition, it would be worthwhile to have. Regards, Paul Elschot
          Hide
          Yonik Seeley added a comment -

          I've been working a little on a faster version of BitSet. That's one place where a stateful iterator implementing nextDocNr() can be faster than nextDocNr(docNr) , so I would like to see the nextDocNr() added.

          Show
          Yonik Seeley added a comment - I've been working a little on a faster version of BitSet. That's one place where a stateful iterator implementing nextDocNr() can be faster than nextDocNr(docNr) , so I would like to see the nextDocNr() added.
          Hide
          Mark Harwood added a comment -

          Removed unused code

          Show
          Mark Harwood added a comment - Removed unused code
          Hide
          Eks Dev added a comment -

          I've been looking at this code and found some minor enhancements that could be done:

          1. Any particular reason for SortedVIntList not to implement DocNrSkipper interface, the method getDocNrSkipper() is there, but declaration is missing.

          2. Should getDocNrSkipper() DocNrSkipper interface throw IOException? I have tried to add TermDocsSortedIntList to the family, but all methods in TermDocs are throwing IOException, and it is not nice to eat silently this exception too early in DocNrSkipper. Better ideas to deal with that?

          3. Paul, why SkipFilter exists (here I refer to the JIRA-330 )? Wouldn't be better to use DocNrSkipper interface instead (SkipFilter does nothing but wrapping this interface). Also, the same question applies to IterFilter. Did I get something wrong here?

          Must say, excelent work!
          A lot of use cases related to Filtering and non-scoring queries can be done efficiently with this

          Show
          Eks Dev added a comment - I've been looking at this code and found some minor enhancements that could be done: 1. Any particular reason for SortedVIntList not to implement DocNrSkipper interface, the method getDocNrSkipper() is there, but declaration is missing. 2. Should getDocNrSkipper() DocNrSkipper interface throw IOException? I have tried to add TermDocsSortedIntList to the family, but all methods in TermDocs are throwing IOException, and it is not nice to eat silently this exception too early in DocNrSkipper. Better ideas to deal with that? 3. Paul, why SkipFilter exists (here I refer to the JIRA-330 )? Wouldn't be better to use DocNrSkipper interface instead (SkipFilter does nothing but wrapping this interface). Also, the same question applies to IterFilter. Did I get something wrong here? Must say, excelent work! A lot of use cases related to Filtering and non-scoring queries can be done efficiently with this
          Hide
          Paul Elschot added a comment -

          > 1. Any particular reason for SortedVIntList not to implement DocNrSkipper interface, the method getDocNrSkipper() is there, but declaration is missing.

          The object returned by the getDocNrSkipper() method implements the interface by adding a bit of state
          for the iteration over the document numbers. This allows more than one iterator on the (non modifiable)
          SortedVIntList.

          > 2. Should getDocNrSkipper() DocNrSkipper interface throw IOException? I have tried to add TermDocsSortedIntList to the family, but all methods in TermDocs are throwing IOException, and it is not nice to eat silently this exception too early in DocNrSkipper. Better ideas to deal with that?

          I have no problem with the addition of throwing an IOException to the DocNrSkipper interface.
          The idea is to filter query results from RAM from which one would not normally expect
          an IOException , so one could also consider rethrowing the IOException wrapped in an Error.
          OTOH, the ability to obtain a DocNrSkipper directly from an index is certainly appealing,
          and then IOException is unavoidable.

          > 3. Paul, why SkipFilter exists (here I refer to the JIRA-330 )? Wouldn't be better to use DocNrSkipper interface instead (SkipFilter does nothing but wrapping this interface). Also, the same question applies to IterFilter. Did I get something wrong here?

          The presence of class BitSet in the bits() method of Filter
          makes it impossible to provide another implementation of a Filter.
          SkipFilter/DocNrSkipper are intended to be parallel to Filter/BitSet,
          and the DocNrSkipper interface allows alternative implementations.
          Both SkipFilter and Filter are interfaces for factories/caches of for filter data structures.

          I'd like to somehow have these parallel paths merged, but I don't now how to
          do that. Perhaps SkipFilter could allow backward compatibility by also
          providing a bits() method, and use that method when it does not throw for
          example an UnsupportedOperationException.
          Another way would be to deprecate Filter in favour of SkipFilter, but that would
          have a lot of backward compatibility issues, and perhaps also some
          performance issues.
          The FilteredQuery of LUCENE-330 allows for both paths to be used,
          both paths are joined at line 211 in this FilteredQuery.

          The IterFilter of LUCENE-330 was replaced by SkipFilter, I forgot
          to indicate that when I downloaded the replacement. I have just deleted
          IterFilter there.

          > Must say, excelent work!

          Thanks. I should add that most of the hard work had already been done in
          org.apache.lucene.store.InputStream.readVInt() and
          org.apache.lucene.store.OutputStream.writeVInt().

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - > 1. Any particular reason for SortedVIntList not to implement DocNrSkipper interface, the method getDocNrSkipper() is there, but declaration is missing. The object returned by the getDocNrSkipper() method implements the interface by adding a bit of state for the iteration over the document numbers. This allows more than one iterator on the (non modifiable) SortedVIntList. > 2. Should getDocNrSkipper() DocNrSkipper interface throw IOException? I have tried to add TermDocsSortedIntList to the family, but all methods in TermDocs are throwing IOException, and it is not nice to eat silently this exception too early in DocNrSkipper. Better ideas to deal with that? I have no problem with the addition of throwing an IOException to the DocNrSkipper interface. The idea is to filter query results from RAM from which one would not normally expect an IOException , so one could also consider rethrowing the IOException wrapped in an Error. OTOH, the ability to obtain a DocNrSkipper directly from an index is certainly appealing, and then IOException is unavoidable. > 3. Paul, why SkipFilter exists (here I refer to the JIRA-330 )? Wouldn't be better to use DocNrSkipper interface instead (SkipFilter does nothing but wrapping this interface). Also, the same question applies to IterFilter. Did I get something wrong here? The presence of class BitSet in the bits() method of Filter makes it impossible to provide another implementation of a Filter. SkipFilter/DocNrSkipper are intended to be parallel to Filter/BitSet, and the DocNrSkipper interface allows alternative implementations. Both SkipFilter and Filter are interfaces for factories/caches of for filter data structures. I'd like to somehow have these parallel paths merged, but I don't now how to do that. Perhaps SkipFilter could allow backward compatibility by also providing a bits() method, and use that method when it does not throw for example an UnsupportedOperationException. Another way would be to deprecate Filter in favour of SkipFilter, but that would have a lot of backward compatibility issues, and perhaps also some performance issues. The FilteredQuery of LUCENE-330 allows for both paths to be used, both paths are joined at line 211 in this FilteredQuery. The IterFilter of LUCENE-330 was replaced by SkipFilter, I forgot to indicate that when I downloaded the replacement. I have just deleted IterFilter there. > Must say, excelent work! Thanks. I should add that most of the hard work had already been done in org.apache.lucene.store.InputStream.readVInt() and org.apache.lucene.store.OutputStream.writeVInt(). Regards, Paul Elschot
          Hide
          Paul Elschot added a comment -

          This patches Filter.java and IndexSearcher.java .

          Filter.java is modified to implement SkipFilter, to provide
          a first step in a backward compatible way to slowly make Filter
          independent of BitSet.
          IndexSearcher.java is modified to test for a DocNrSkipper from
          a given Filter, and to use that. In that case also skipTo() is used
          on the scorer of the query being filtered.
          This patch requires org.apache.lucene.util.DocNrSkipper, which
          available at this issue.
          Also required is org.apache.lucene.search.SkipFilter, which
          is available at LUCENE-330.

          The patch also contains some commented test code for Filter.java.
          This test code always provides a DocNrSkipper (from the BitSet).
          With and without this test code, all tests pass here.

          When extending Filter in this way, SkipFilter may not be necessary
          at all. I left it in to allow a path forward to complete independence
          from BitSet.
          In case SkipFilter stays, it would be good to add (a) new method(s)
          to IndexSearcher allowing a SkipFilter to filter a query.

          The DocNrSkipper interface contains only one method:
          nextDocNr(int docNr).
          It may be good for performance to also add a nextDocNr()
          method without any argument, much like skipTo(target) and next()
          on Scorer. IOW, I do not consider DocNrSkipper stable at this moment.

          I don't think this patch should be added to release 2.0.

          Show
          Paul Elschot added a comment - This patches Filter.java and IndexSearcher.java . Filter.java is modified to implement SkipFilter, to provide a first step in a backward compatible way to slowly make Filter independent of BitSet. IndexSearcher.java is modified to test for a DocNrSkipper from a given Filter, and to use that. In that case also skipTo() is used on the scorer of the query being filtered. This patch requires org.apache.lucene.util.DocNrSkipper, which available at this issue. Also required is org.apache.lucene.search.SkipFilter, which is available at LUCENE-330 . The patch also contains some commented test code for Filter.java. This test code always provides a DocNrSkipper (from the BitSet). With and without this test code, all tests pass here. When extending Filter in this way, SkipFilter may not be necessary at all. I left it in to allow a path forward to complete independence from BitSet. In case SkipFilter stays, it would be good to add (a) new method(s) to IndexSearcher allowing a SkipFilter to filter a query. The DocNrSkipper interface contains only one method: nextDocNr(int docNr). It may be good for performance to also add a nextDocNr() method without any argument, much like skipTo(target) and next() on Scorer. IOW, I do not consider DocNrSkipper stable at this moment. I don't think this patch should be added to release 2.0.
          Hide
          Paul Elschot added a comment -

          Starting from SkipFilter1.patch as above, a replacement of Filter by SkipFilter in the various API's
          (Searcher, Searchable and implementors) is straightforward. The only thing further needed is a
          checked cast to Filter in IndexSearcher.search(weight, filter, hitcollector) for the case when
          the DocNrSkipper is null. (When that cast to Filter fails an IllegalArgumentException can be thrown).
          After that, all tests pass again, with and without the test code in Filter to always use a DocNrSkipper.

          That means that it is easier than expected to replace Filter by SkipFilter altogether.

          Show
          Paul Elschot added a comment - Starting from SkipFilter1.patch as above, a replacement of Filter by SkipFilter in the various API's (Searcher, Searchable and implementors) is straightforward. The only thing further needed is a checked cast to Filter in IndexSearcher.search(weight, filter, hitcollector) for the case when the DocNrSkipper is null. (When that cast to Filter fails an IllegalArgumentException can be thrown). After that, all tests pass again, with and without the test code in Filter to always use a DocNrSkipper. That means that it is easier than expected to replace Filter by SkipFilter altogether.
          Hide
          Paul Elschot added a comment -

          This issue is outdated now. The related LUCENE-584 has a new implementation.

          Show
          Paul Elschot added a comment - This issue is outdated now. The related LUCENE-584 has a new implementation.
          Hide
          Otis Gospodnetic added a comment -

          Moved to LUCENE-584.
          Those of you who voted for this (5 votes!) may want to vote for LUCENE-584 now.

          Show
          Otis Gospodnetic added a comment - Moved to LUCENE-584 . Those of you who voted for this (5 votes!) may want to vote for LUCENE-584 now.

            People

            • Assignee:
              Unassigned
              Reporter:
              Paul Elschot
            • Votes:
              5 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development