Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The purpose of QueryWrapperFilter is to simply filter to include the docIDs that match the query.

      Its implementation is wasteful now because it computes scores for those matching docs even though the score is unused. We could fix this by getting a Scorer and iterating through the docs without asking for the score:

      Index: src/java/org/apache/lucene/search/QueryWrapperFilter.java
      ===================================================================
      --- src/java/org/apache/lucene/search/QueryWrapperFilter.java	(revision 707060)
      +++ src/java/org/apache/lucene/search/QueryWrapperFilter.java	(working copy)
      @@ -62,11 +62,9 @@
         public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
           final OpenBitSet bits = new OpenBitSet(reader.maxDoc());
       
      -    new IndexSearcher(reader).search(query, new HitCollector() {
      -      public final void collect(int doc, float score) {
      -        bits.set(doc);  // set bit for hit
      -      }
      -    });
      +    final Scorer scorer = query.weight(new IndexSearcher(reader)).scorer(reader);
      +    while(scorer.next())
      +      bits.set(scorer.doc());
           return bits;
         }
      

      Maybe I'm missing something, but this seams like a simple win?

        Activity

        Hide
        Paul Elschot added a comment -

        Indeed a simple win, but it can be even simpler (untested):

          return new DocIdSet() {
            public DocIdSetIterator iterator() {
              return scorer;
            }
          };
        

        This DocIdSetIterator should be used only once, so it's best to cache it in a CachingWrapperFilter, and the javadocs could indicate that.

        See also LUCENE-1296, which allows the choice of a supporting data structure other than OpenBitSet.

        Show
        Paul Elschot added a comment - Indeed a simple win, but it can be even simpler (untested): return new DocIdSet() { public DocIdSetIterator iterator() { return scorer; } }; This DocIdSetIterator should be used only once, so it's best to cache it in a CachingWrapperFilter, and the javadocs could indicate that. See also LUCENE-1296 , which allows the choice of a supporting data structure other than OpenBitSet.
        Hide
        Michael McCandless added a comment -

        Excellent! That's about as simple as it can get

        But, I don't really like forcefully wrapping CachingWrapperFilter inside – that should be up to the caller to decide?

        Show
        Michael McCandless added a comment - Excellent! That's about as simple as it can get But, I don't really like forcefully wrapping CachingWrapperFilter inside – that should be up to the caller to decide?
        Hide
        Paul Elschot added a comment -

        But, I don't really like forcefully wrapping CachingWrapperFilter inside - that should be up to the caller to decide?

        Agree. Perhaps we should have a OneTimeDocIdSet for cases like this one, and leave the possibility to repeatedly generate a DocIdSetIterator to caching filters. Just thinking out loud.

        A OneTimeDocIdSet would throw an for example an IllegalStateException when its iterator() method is called more than once.

        Show
        Paul Elschot added a comment - But, I don't really like forcefully wrapping CachingWrapperFilter inside - that should be up to the caller to decide? Agree. Perhaps we should have a OneTimeDocIdSet for cases like this one, and leave the possibility to repeatedly generate a DocIdSetIterator to caching filters. Just thinking out loud. A OneTimeDocIdSet would throw an for example an IllegalStateException when its iterator() method is called more than once.
        Hide
        Michael McCandless added a comment -

        Perhaps we should have a OneTimeDocIdSet for cases like this one, and leave the possibility to repeatedly generate a DocIdSetIterator to caching filters

        I'm torn on this. It's nice in that it'd "forcefully" remind you that if you are re-using a filter you really should cache it. But, then, there are legitimate cases where you don't want to cache it (eg you know you will use it rarely, it's fast enough, and you don't want to spend the RAM).

        Also, for this instance it'd be a break in back compatibility since you can currently re-use a QueryWrapperFilter instance.

        So I guess I'm leaning back towards my original patch, which still allows re-use, but does not waste CPU computing scores which are just discarded.

        Show
        Michael McCandless added a comment - Perhaps we should have a OneTimeDocIdSet for cases like this one, and leave the possibility to repeatedly generate a DocIdSetIterator to caching filters I'm torn on this. It's nice in that it'd "forcefully" remind you that if you are re-using a filter you really should cache it. But, then, there are legitimate cases where you don't want to cache it (eg you know you will use it rarely, it's fast enough, and you don't want to spend the RAM). Also, for this instance it'd be a break in back compatibility since you can currently re-use a QueryWrapperFilter instance. So I guess I'm leaning back towards my original patch, which still allows re-use, but does not waste CPU computing scores which are just discarded.
        Hide
        Paul Elschot added a comment -

        The new Filter api allows to split the concerns of which data structure to use for collecting the docs in the DocIdSet and the cached data structure used to iterate over this set, and this is what shows up here.

        For backward compatibility QueryWrapperFilter could use an OpenBitSet that is good for collecting the docids, but the new Filter api leaves it not really necessary to use a data structure at all (see my initial suggestion).

        So the question is how we want to deal with the split between initial collecting and later repeated iterations. OpenBitSet is certainly good for collecting, so a good and backward compatible way would be to document the use of OpenBitSet in the javadocs of QueryWrapperFilter, and let CachingWrapperFilter decide later which data structure to cache.
        The alternative would be to let CachingWrapperFilter always do the initial collecting , but that would not be backward compatible.

        instanceof could be used to decide at CachingWrapperFilter to do this initial collecting when it's not sure that the given data structure allows repeated iteration, but it may be better to add a boolean method to DocIdSet that indicates whether the iterator can be used more than once or not. However, that is better left to LUCENE-1296 .

        In short, I'd like to have a javadoc remark added to the original patch on the use of OpenBitSet, and leave the rest to LUCENE-1296 .

        Show
        Paul Elschot added a comment - The new Filter api allows to split the concerns of which data structure to use for collecting the docs in the DocIdSet and the cached data structure used to iterate over this set, and this is what shows up here. For backward compatibility QueryWrapperFilter could use an OpenBitSet that is good for collecting the docids, but the new Filter api leaves it not really necessary to use a data structure at all (see my initial suggestion). So the question is how we want to deal with the split between initial collecting and later repeated iterations. OpenBitSet is certainly good for collecting, so a good and backward compatible way would be to document the use of OpenBitSet in the javadocs of QueryWrapperFilter, and let CachingWrapperFilter decide later which data structure to cache. The alternative would be to let CachingWrapperFilter always do the initial collecting , but that would not be backward compatible. instanceof could be used to decide at CachingWrapperFilter to do this initial collecting when it's not sure that the given data structure allows repeated iteration, but it may be better to add a boolean method to DocIdSet that indicates whether the iterator can be used more than once or not. However, that is better left to LUCENE-1296 . In short, I'd like to have a javadoc remark added to the original patch on the use of OpenBitSet, and leave the rest to LUCENE-1296 .
        Hide
        Michael McCandless added a comment -

        Actually, can't we simply instantiate a new scorer each time iterator() is called? Then we don't need an intermediate OpenBitSet and we can simply return the scorer (your original suggestion).

        The only problem is... we then need to add "throws IOException" to DocIdSet.iterator(). While that is technically a non-back-compatible change (places that call DocIdSet.iterator() may suddenly have to add "throws IOException" to their method signatures, up the chain), I think it's likely very rare in practice that a code change would be needed, since the next() method of the iterator throws IOException and presumably almost all code that gets an iterator then next()'s through it. There were no changes in Lucene's core or contrib sources necessary on adding this. I think it's an acceptable change.

        Then the patch looks like this:

        Index: src/java/org/apache/lucene/search/DocIdSet.java
        ===================================================================
        --- src/java/org/apache/lucene/search/DocIdSet.java	(revision 708628)
        +++ src/java/org/apache/lucene/search/DocIdSet.java	(working copy)
        @@ -17,11 +17,12 @@
          * limitations under the License.
          */
         
        +import java.io.IOException;
         
         /**
          * A DocIdSet contains a set of doc ids. Implementing classes must provide
          * a {@link DocIdSetIterator} to access the set. 
          */
         public abstract class DocIdSet {
        -	public abstract DocIdSetIterator iterator();
        +	public abstract DocIdSetIterator iterator() throws IOException;
         }
        Index: src/java/org/apache/lucene/search/QueryWrapperFilter.java
        ===================================================================
        --- src/java/org/apache/lucene/search/QueryWrapperFilter.java	(revision 708628)
        +++ src/java/org/apache/lucene/search/QueryWrapperFilter.java	(working copy)
        @@ -59,15 +59,13 @@
             return bits;
           }
           
        -  public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
        -    final OpenBitSet bits = new OpenBitSet(reader.maxDoc());
        -
        -    new IndexSearcher(reader).search(query, new HitCollector() {
        -      public final void collect(int doc, float score) {
        -        bits.set(doc);  // set bit for hit
        +  public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {
        +    final Weight weight = query.weight(new IndexSearcher(reader));
        +    return new DocIdSet() {
        +      public DocIdSetIterator iterator() throws IOException {
        +        return weight.scorer(reader);
               }
        -    });
        -    return bits;
        +    };
           }
         
           public String toString() {
        

        I do agree, longer term, that clarifying the semantics to allow some DocIDSets that do not allow more than one call to iterator(), and then requiring something like CachingWrapperFilter to "translate" between different DocIdSets (compact or not, re-iterable, etc) is worth thinking about. Though, besides this case, which seems easy to fix by just getting another scorer in iterator(), are there other places where not having to provide a repeatable iterator buys us some compelling freedom?

        Show
        Michael McCandless added a comment - Actually, can't we simply instantiate a new scorer each time iterator() is called? Then we don't need an intermediate OpenBitSet and we can simply return the scorer (your original suggestion). The only problem is... we then need to add "throws IOException" to DocIdSet.iterator(). While that is technically a non-back-compatible change (places that call DocIdSet.iterator() may suddenly have to add "throws IOException" to their method signatures, up the chain), I think it's likely very rare in practice that a code change would be needed, since the next() method of the iterator throws IOException and presumably almost all code that gets an iterator then next()'s through it. There were no changes in Lucene's core or contrib sources necessary on adding this. I think it's an acceptable change. Then the patch looks like this: Index: src/java/org/apache/lucene/search/DocIdSet.java =================================================================== --- src/java/org/apache/lucene/search/DocIdSet.java (revision 708628) +++ src/java/org/apache/lucene/search/DocIdSet.java (working copy) @@ -17,11 +17,12 @@ * limitations under the License. */ + import java.io.IOException; /** * A DocIdSet contains a set of doc ids. Implementing classes must provide * a {@link DocIdSetIterator} to access the set. */ public abstract class DocIdSet { - public abstract DocIdSetIterator iterator(); + public abstract DocIdSetIterator iterator() throws IOException; } Index: src/java/org/apache/lucene/search/QueryWrapperFilter.java =================================================================== --- src/java/org/apache/lucene/search/QueryWrapperFilter.java (revision 708628) +++ src/java/org/apache/lucene/search/QueryWrapperFilter.java (working copy) @@ -59,15 +59,13 @@ return bits; } - public DocIdSet getDocIdSet(IndexReader reader) throws IOException { - final OpenBitSet bits = new OpenBitSet(reader.maxDoc()); - - new IndexSearcher(reader).search(query, new HitCollector() { - public final void collect( int doc, float score) { - bits.set(doc); // set bit for hit + public DocIdSet getDocIdSet( final IndexReader reader) throws IOException { + final Weight weight = query.weight( new IndexSearcher(reader)); + return new DocIdSet() { + public DocIdSetIterator iterator() throws IOException { + return weight.scorer(reader); } - }); - return bits; + }; } public String toString() { I do agree, longer term, that clarifying the semantics to allow some DocIDSets that do not allow more than one call to iterator(), and then requiring something like CachingWrapperFilter to "translate" between different DocIdSets (compact or not, re-iterable, etc) is worth thinking about. Though, besides this case, which seems easy to fix by just getting another scorer in iterator(), are there other places where not having to provide a repeatable iterator buys us some compelling freedom?
        Hide
        Paul Elschot added a comment -

        ... are there other places where not having to provide a repeatable iterator buys us some compelling freedom?

        One might want to cache the Filter based on the version (last mod time) of the index, and not based on the actual index reader. In that case the original reader could not be available when the Filter is used.

        I don't know whether that is compelling given the current cost of (re)opening an index.

        Show
        Paul Elschot added a comment - ... are there other places where not having to provide a repeatable iterator buys us some compelling freedom? One might want to cache the Filter based on the version (last mod time) of the index, and not based on the actual index reader. In that case the original reader could not be available when the Filter is used. I don't know whether that is compelling given the current cost of (re)opening an index.
        Hide
        Michael McCandless added a comment -

        OK I plan to commit the above patch in a day or two.

        Show
        Michael McCandless added a comment - OK I plan to commit the above patch in a day or two.
        Hide
        Michael McCandless added a comment -

        Committed revision 709459.

        Thanks Paul!

        Show
        Michael McCandless added a comment - Committed revision 709459. Thanks Paul!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development