Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None

      Description

      package org.apache.lucene.search;
      
      public abstract class Filter implements java.io.Serializable 
      {
        public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
      }
      
      public interface AbstractBitSet 
      {
        public boolean get(int index);
      }
      
      

      It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=.

      Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible.
      Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint.

      Though it is possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose.
      That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=.

      1. Test20080111.patch
        6 kB
        Paul Elschot
      2. Some Matchers.zip
        9 kB
        Eks Dev
      3. Matcher-20071122-1ground.patch
        32 kB
        Paul Elschot
      4. Matcher-20070905-3core.patch
        18 kB
        Paul Elschot
      5. Matcher-20070905-2default.patch
        75 kB
        Paul Elschot
      6. lucene-584-take5-part2.patch
        76 kB
        Michael Busch
      7. lucene-584-take5-part1.patch
        47 kB
        Michael Busch
      8. lucene-584-take4-part2.patch
        77 kB
        Michael Busch
      9. lucene-584-take4-part1.patch
        39 kB
        Michael Busch
      10. lucene-584-take3-part2.patch
        77 kB
        Michael Busch
      11. lucene-584-take3-part1.patch
        37 kB
        Michael Busch
      12. lucene-584-take2.patch
        52 kB
        Michael Busch
      13. lucene-584.patch
        34 kB
        Michael Busch
      14. ContribQueries20080111.patch
        2 kB
        Paul Elschot
      15. CHANGES.txt.patch
        0.7 kB
        Paul Elschot
      16. bench-diff.txt
        7 kB
        Otis Gospodnetic
      17. bench-diff.txt
        8 kB
        Doron Cohen

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Its better now I think - could prob be improved still (by someone who has sunk their head into the new Scorer stuff more than I have) - but I read over the code and did what I could. At the least, its no longer wildly out of whack.

          Show
          Mark Miller added a comment - Its better now I think - could prob be improved still (by someone who has sunk their head into the new Scorer stuff more than I have) - but I read over the code and did what I could. At the least, its no longer wildly out of whack.
          Hide
          Michael Busch added a comment -

          Mark, are you working on this? Wanna assign this to you?

          Show
          Michael Busch added a comment - Mark, are you working on this? Wanna assign this to you?
          Hide
          Mark Miller added a comment -

          I think I have tracked down this issue as the one changing things most regarding Scorer documentation in org.apache.lucene.search.package.html.

          I've started updating it, but it needs a tad more work I think - reopening this issue as a convenient marker for this work.

          Show
          Mark Miller added a comment - I think I have tracked down this issue as the one changing things most regarding Scorer documentation in org.apache.lucene.search.package.html. I've started updating it, but it needs a tad more work I think - reopening this issue as a convenient marker for this work.
          Hide
          Paul Elschot added a comment -

          Wouter, about this:
          java.lang.ClassCastException: java.util.BitSet cannot be cast to org.apache.lucene.search.DocIdSet

          LUCENE-1187 should have fixed this, so could you file a bug report?
          In case you need a workaround, also have a look at LUCENE-1296.

          Show
          Paul Elschot added a comment - Wouter, about this: java.lang.ClassCastException: java.util.BitSet cannot be cast to org.apache.lucene.search.DocIdSet LUCENE-1187 should have fixed this, so could you file a bug report? In case you need a workaround, also have a look at LUCENE-1296 .
          Hide
          Wouter Heijke added a comment -

          We got the same error here on a 15Gb index with Lucene 2.4.0:

          java.lang.ClassCastException: java.util.BitSet cannot be cast to org.apache.lucene.search.DocIdSet
          org.apache.lucene.search.CachingWrapperFilter.getDocIdSet(CachingWrapperFilter.java:76)
          org.apache.lucene.misc.ChainedFilter.getDocIdSet(ChainedFilter.java:200)
          org.apache.lucene.misc.ChainedFilter.getDocIdSet(ChainedFilter.java:145)
          org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:140)
          org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
          org.apache.lucene.search.Searcher.search(Searcher.java:136)

          Show
          Wouter Heijke added a comment - We got the same error here on a 15Gb index with Lucene 2.4.0: java.lang.ClassCastException: java.util.BitSet cannot be cast to org.apache.lucene.search.DocIdSet org.apache.lucene.search.CachingWrapperFilter.getDocIdSet(CachingWrapperFilter.java:76) org.apache.lucene.misc.ChainedFilter.getDocIdSet(ChainedFilter.java:200) org.apache.lucene.misc.ChainedFilter.getDocIdSet(ChainedFilter.java:145) org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:140) org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) org.apache.lucene.search.Searcher.search(Searcher.java:136)
          Hide
          Paul Elschot added a comment -

          From the traceback I suppose this happened at the end, using the ChainedFilter?
          Iirc ChainedFilter is from contrib/..., and it is mentioned at LUCENE-1187 as one of the things to be done.
          Could you contribute this code as a contrib/... test case there?
          Sorry, I don't remember exactly from which contrib module ChainedFilter is.

          Show
          Paul Elschot added a comment - From the traceback I suppose this happened at the end, using the ChainedFilter? Iirc ChainedFilter is from contrib/..., and it is mentioned at LUCENE-1187 as one of the things to be done. Could you contribute this code as a contrib/... test case there? Sorry, I don't remember exactly from which contrib module ChainedFilter is.
          Hide
          Mark Miller added a comment - - edited

          I think there is still an issue here. The code below just broke for me.

          java.lang.ClassCastException: org.apache.lucene.util.OpenBitSet cannot be cast to java.util.BitSet
          at org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:55)
          at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:177)
          at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:152)
          at org.apache.lucene.search.Filter.getDocIdSet(Filter.java:49)

            public void testChainedCachedQueryFilter() throws IOException, ParseException {
              String path = "c:/TestIndex";
              Analyzer analyzer = new WhitespaceAnalyzer();
              IndexWriter writer = new IndexWriter(path, analyzer, true);
          
              Document doc = new Document();
              doc.add(new Field("category", "red", Store.YES, Index.TOKENIZED));
              doc.add(new Field("content", "the big bad fox", Store.NO, Index.TOKENIZED));
              writer.addDocument(doc);
              doc = new Document();
              doc.add(new Field("category", "red", Store.YES, Index.TOKENIZED));
              doc.add(new Field("content", "the big bad pig", Store.NO, Index.TOKENIZED));
              writer.addDocument(doc);
              doc = new Document();
              doc.add(new Field("category", "red", Store.YES, Index.TOKENIZED));
              doc.add(new Field("content", "the horrific girl", Store.NO, Index.TOKENIZED));
              writer.addDocument(doc);
              doc = new Document();
              doc.add(new Field("category", "blue", Store.YES, Index.TOKENIZED));
              doc.add(new Field("content", "the dirty boy", Store.NO, Index.TOKENIZED));
              writer.addDocument(doc);
              doc = new Document();
              doc.add(new Field("category", "blue", Store.YES, Index.TOKENIZED));
              doc.add(new Field("content", "the careful bad fox", Store.NO, Index.TOKENIZED));
              writer.addDocument(doc);
          
              writer.close();
          
              Searcher searcher = null;
          
              searcher = new IndexSearcher(path);
          
              QueryParser qp = new QueryParser("field", new KeywordAnalyzer());
              Query query = qp.parse("content:fox");
              QueryWrapperFilter queryFilter = new QueryWrapperFilter(query);
              CachingWrapperFilter cwf = new CachingWrapperFilter(queryFilter);
          
              TopDocs hits = searcher.search(query, cwf, 1);
              System.out.println("hits:" + hits.totalHits);
          
              queryFilter = new QueryWrapperFilter(qp.parse("category:red"));
              CachingWrapperFilter fcwf = new CachingWrapperFilter(queryFilter);
              Filter[] chain = new Filter[2];
              chain[0] = cwf;
              chain[1] = fcwf;
              ChainedFilter cf = new ChainedFilter(chain, ChainedFilter.AND);
          
              hits = searcher.search(new MatchAllDocsQuery(), cf, 1);
          
              System.out.println("red:" + hits.totalHits);
          
              queryFilter = new QueryWrapperFilter(qp.parse("category:blue"));
              CachingWrapperFilter fbcwf = new CachingWrapperFilter(queryFilter);
              chain = new Filter[2];
              chain[0] = cwf;
              chain[1] = fbcwf;
              cf = new ChainedFilter(chain, ChainedFilter.AND);
          
              hits = searcher.search(new MatchAllDocsQuery(), cf, 1);
          
              System.out.println("blue:" + hits.totalHits);
          
            }
          
          
          Show
          Mark Miller added a comment - - edited I think there is still an issue here. The code below just broke for me. java.lang.ClassCastException: org.apache.lucene.util.OpenBitSet cannot be cast to java.util.BitSet at org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:55) at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:177) at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:152) at org.apache.lucene.search.Filter.getDocIdSet(Filter.java:49) public void testChainedCachedQueryFilter() throws IOException, ParseException { String path = "c:/TestIndex" ; Analyzer analyzer = new WhitespaceAnalyzer(); IndexWriter writer = new IndexWriter(path, analyzer, true ); Document doc = new Document(); doc.add( new Field( "category" , "red" , Store.YES, Index.TOKENIZED)); doc.add( new Field( "content" , "the big bad fox" , Store.NO, Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add( new Field( "category" , "red" , Store.YES, Index.TOKENIZED)); doc.add( new Field( "content" , "the big bad pig" , Store.NO, Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add( new Field( "category" , "red" , Store.YES, Index.TOKENIZED)); doc.add( new Field( "content" , "the horrific girl" , Store.NO, Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add( new Field( "category" , "blue" , Store.YES, Index.TOKENIZED)); doc.add( new Field( "content" , "the dirty boy" , Store.NO, Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add( new Field( "category" , "blue" , Store.YES, Index.TOKENIZED)); doc.add( new Field( "content" , "the careful bad fox" , Store.NO, Index.TOKENIZED)); writer.addDocument(doc); writer.close(); Searcher searcher = null ; searcher = new IndexSearcher(path); QueryParser qp = new QueryParser( "field" , new KeywordAnalyzer()); Query query = qp.parse( "content:fox" ); QueryWrapperFilter queryFilter = new QueryWrapperFilter(query); CachingWrapperFilter cwf = new CachingWrapperFilter(queryFilter); TopDocs hits = searcher.search(query, cwf, 1); System .out.println( "hits:" + hits.totalHits); queryFilter = new QueryWrapperFilter(qp.parse( "category:red" )); CachingWrapperFilter fcwf = new CachingWrapperFilter(queryFilter); Filter[] chain = new Filter[2]; chain[0] = cwf; chain[1] = fcwf; ChainedFilter cf = new ChainedFilter(chain, ChainedFilter.AND); hits = searcher.search( new MatchAllDocsQuery(), cf, 1); System .out.println( "red:" + hits.totalHits); queryFilter = new QueryWrapperFilter(qp.parse( "category:blue" )); CachingWrapperFilter fbcwf = new CachingWrapperFilter(queryFilter); chain = new Filter[2]; chain[0] = cwf; chain[1] = fbcwf; cf = new ChainedFilter(chain, ChainedFilter.AND); hits = searcher.search( new MatchAllDocsQuery(), cf, 1); System .out.println( "blue:" + hits.totalHits); }
          Hide
          Michael Busch added a comment -

          I have attached a patch to CHANGES.txt to explicitly state that filters outside the lucene code base will need to be adapted.

          I added your comment to CHANGES.txt, Paul. Thanks again!

          Show
          Michael Busch added a comment - I have attached a patch to CHANGES.txt to explicitly state that filters outside the lucene code base will need to be adapted. I added your comment to CHANGES.txt, Paul. Thanks again!
          Hide
          Paul Elschot added a comment -

          Thanks, my pleasure.

          I have attached a patch to CHANGES.txt to explicitly state that filters outside the lucene code base will need to be adapted.

          Show
          Paul Elschot added a comment - Thanks, my pleasure. I have attached a patch to CHANGES.txt to explicitly state that filters outside the lucene code base will need to be adapted.
          Hide
          Michael Busch added a comment -

          Committed.

          I looked again at all the comments here - wow this issue was open for a long time and at lot of work was done here.

          Paul, thanks for your patience and all your hard work!!

          Show
          Michael Busch added a comment - Committed. I looked again at all the comments here - wow this issue was open for a long time and at lot of work was done here. Paul, thanks for your patience and all your hard work!!
          Hide
          Michael Busch added a comment -

          Thanks, Paul for testing and reviewing.

          I'll correct the javadocs.

          OK, I will commit this tomorrow if nobody objects!

          Show
          Michael Busch added a comment - Thanks, Paul for testing and reviewing. I'll correct the javadocs. OK, I will commit this tomorrow if nobody objects!
          Hide
          Paul Elschot added a comment -

          The take5 patch tests ok here.

          One very minor remark: the javadoc at RangeFilter.getDocIdSet still mentions BitSet.

          Show
          Paul Elschot added a comment - The take5 patch tests ok here. One very minor remark: the javadoc at RangeFilter.getDocIdSet still mentions BitSet.
          Hide
          Michael Busch added a comment -

          OK here's a new version of the patch.

          It's based on the take4 patch with the following changes:

          • removed BitSetFilter
          • added getBitSet() method to DocIdBitSet
          • added Paul's Test20080111.patch
          • changed TestScorerPerf to use a DocIdBitSet
          • changed ChainedFilterTest, CachedFilterBuilder, TestRemoteSearchable
            to use QueryWrapperFilter instead of deprecated QueryFilter

          Comments:

          • As mentioned in my previous comment it's not possible to wrap a
            CachingWrapperFilter around a BitSetFilter and then retrieve the BitSet
            from the CachingWrapperFilter. That's the reason why I removed
            BitSetFilter and added the getBitSet() method to DocIdBitSet instead.
            So you can do something like this:
            DocIdSet docIdSet = filter.getDocIdSet();
            if (docIdSet instanceof DocIdBitSet) {
              BitSet bits = ((DocIdBitSet) docIdSet).getBitSet();
              ...
            }
            
          • I didn't include Paul's ContribQueries20080111.patch because it only
            changes some contrib filters to extend BitSetFilter instead of Filter.
          • I like Eks' suggestion of implementing the BooleanFilter in a way that
            it can use any DocIdSetIterator and optimize special cases, when DocIdSet
            is a DocIdBitSet, OpenBitSet, etc. We should do this with a different JIRA
            issue - this patch is already big enough.

          All core & contrib tests pass.

          I'd like to commit this in a couple of days if nobody objects.

          Show
          Michael Busch added a comment - OK here's a new version of the patch. It's based on the take4 patch with the following changes: removed BitSetFilter added getBitSet() method to DocIdBitSet added Paul's Test20080111.patch changed TestScorerPerf to use a DocIdBitSet changed ChainedFilterTest, CachedFilterBuilder, TestRemoteSearchable to use QueryWrapperFilter instead of deprecated QueryFilter Comments: As mentioned in my previous comment it's not possible to wrap a CachingWrapperFilter around a BitSetFilter and then retrieve the BitSet from the CachingWrapperFilter. That's the reason why I removed BitSetFilter and added the getBitSet() method to DocIdBitSet instead. So you can do something like this: DocIdSet docIdSet = filter.getDocIdSet(); if (docIdSet instanceof DocIdBitSet) { BitSet bits = ((DocIdBitSet) docIdSet).getBitSet(); ... } I didn't include Paul's ContribQueries20080111.patch because it only changes some contrib filters to extend BitSetFilter instead of Filter. I like Eks' suggestion of implementing the BooleanFilter in a way that it can use any DocIdSetIterator and optimize special cases, when DocIdSet is a DocIdBitSet, OpenBitSet, etc. We should do this with a different JIRA issue - this patch is already big enough. All core & contrib tests pass. I'd like to commit this in a couple of days if nobody objects.
          Hide
          Michael Busch added a comment -

          You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be "false".

          To change the signature of skipTo() would be an API change, because with this patch Scorer extends DocIdSetIterator.

          -Michael

          Show
          Michael Busch added a comment - You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be "false". To change the signature of skipTo() would be an API change, because with this patch Scorer extends DocIdSetIterator. -Michael
          Hide
          Mark Harwood added a comment -

          Hi Paul,
          Just eyeballed the code but not had a chance to patch and run it yet.

          I was wondering about the return type for skipTo() after looking at these types of calls:
          if (docIdSetIterator.skipTo && (docIdSetIterator.doc() == i))

          You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be "false".
          Not tried benchmarking it but does this seem like something worth considering?

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Hi Paul, Just eyeballed the code but not had a chance to patch and run it yet. I was wondering about the return type for skipTo() after looking at these types of calls: if (docIdSetIterator.skipTo && (docIdSetIterator.doc() == i)) You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be "false". Not tried benchmarking it but does this seem like something worth considering? Cheers Mark
          Hide
          Paul Elschot added a comment -

          I indeed recall having an problem with remote filter caching. At the time I thought it was related to serialization but I could not resolve it that way. Never mind, it does not matter anymore.

          BooleanFilter and ChainedFilter have the same issue here. As they provide just about the same functionality, could they perhaps be merged?

          The solution using DocIdSet.and() and DocIdSet.or() looks good to me, but it will require some form of collector for the results, much like HitCollector.collect(doc, score) now and MatchCollector.collect(doc) in the Matcher...patch.
          The boolean operations could then be accumulated into a BitSet or into an OpenBitSet, using a special case for DocId(Open)BitSet.

          I'd like these boolean operations on DocIdSets to be general enough for use in Scorers, for example for the conjunctions in ConjunctionScorer, PhraseScorer and in the two NearSpans. But that is another issue.

          Show
          Paul Elschot added a comment - I indeed recall having an problem with remote filter caching. At the time I thought it was related to serialization but I could not resolve it that way. Never mind, it does not matter anymore. BooleanFilter and ChainedFilter have the same issue here. As they provide just about the same functionality, could they perhaps be merged? The solution using DocIdSet.and() and DocIdSet.or() looks good to me, but it will require some form of collector for the results, much like HitCollector.collect(doc, score) now and MatchCollector.collect(doc) in the Matcher...patch. The boolean operations could then be accumulated into a BitSet or into an OpenBitSet, using a special case for DocId(Open)BitSet. I'd like these boolean operations on DocIdSets to be general enough for use in Scorers, for example for the conjunctions in ConjunctionScorer, PhraseScorer and in the two NearSpans. But that is another issue.
          Hide
          Eks Dev added a comment -

          Michal, would this work?
          1. providing default implementation for basic methods that is using skipping iterator(always there), so it works by default for all implementations, something along the lines:

          /**

          • A DocIdSet contains a set of doc ids. Implementing classes must provide
          • a {@link DocIdSetIterator}

            to access the set.
            */
            public abstract class DocIdSet {
            public abstract DocIdSetIterator iterator();

          public DocIdSet and(DocIdSet)

          { // default implementation using *iterator*; }

          public DocIdSet or(DocIdSet)

          { // default implementation using iterator; }

          }

          2. And then we optimize particular cases, e.g

          public class DocIdBitSet extends DocIdSet{
          BitSet bits; // Must be there in order for iterator to work....

          public DocIdSetIterator iterator()

          { //this is easy... }

          public DocIdSet and(DocIdSet dis){
          if (dois instanceof DocIdBitSet)

          { //not exactly like this, but the idea is there this.bits.and(((DocIdBitSet) dis)); return this; }

          return super.and(DocIdSet);

          }
          }

          So it works always, and it works fast if need be, one instanceof check does not hurt there. Did I miss something obvious?

          Show
          Eks Dev added a comment - Michal, would this work? 1. providing default implementation for basic methods that is using skipping iterator(always there), so it works by default for all implementations, something along the lines: /** A DocIdSet contains a set of doc ids. Implementing classes must provide a {@link DocIdSetIterator} to access the set. */ public abstract class DocIdSet { public abstract DocIdSetIterator iterator(); public DocIdSet and(DocIdSet) { // default implementation using *iterator*; } public DocIdSet or(DocIdSet) { // default implementation using iterator; } } 2. And then we optimize particular cases, e.g public class DocIdBitSet extends DocIdSet{ BitSet bits; // Must be there in order for iterator to work.... public DocIdSetIterator iterator() { //this is easy... } public DocIdSet and(DocIdSet dis){ if (dois instanceof DocIdBitSet) { //not exactly like this, but the idea is there this.bits.and(((DocIdBitSet) dis)); return this; } return super.and(DocIdSet); } } So it works always, and it works fast if need be, one instanceof check does not hurt there. Did I miss something obvious?
          Hide
          Michael Busch added a comment -

          I think I understand now which problems you had when you wanted to
          change BooleanFilter and xml-query-parser to use the new Filter APIs.

          BooleanFilter is optimized to utilize BitSets for performing boolean
          operations fast. Now if we change BooleanFilter to use the new
          DocIdSetIterator, then it can't use the fast BitSet operations (e. g.
          union for or, intersect for and) anymore.

          Now we can introduce BitSetFilter as you suggested and what I did in
          the take4 patch. But here's the problem: Introducing subclasses of
          Filter doesn't play nicely with the caching mechanism in Lucene.
          For example: if we change BooleanFilter to only work with
          BitSetFilters, then it won't work with a CachingWrapperFilter anymore,
          because CachingWrapperFilter extends Filter. Then we would have to
          introduce new CachingWrapper***Filter, for the different kinds of
          Filter subclasses, which is a bad design as Mark pointed out in his
          comment: https://issues.apache.org/jira/browse/LUCENE-584?focusedCommentId=12547901#action_12547901

          One solution would be to add a getBitSet() method to DocIdBitSet.
          DocIdBitSet is a new class that is basically just a wrapper around a
          Java BitSet and provides a DocIdSetIterator to access the BitSet.

          Then BooleanFilter could do something like this:

          DocIdSet docIdSet = filter.getDocIdSet();
          if (docIdSet instanceof DocIdBitSet) {
            BitSet bits = ((DocIdBitSet) docIdSet).getBitSet();
            ... // existing code
          } else {
            throw new UnsupportedOperationException("BooleanFilter only 
            supports Filters that use DocIdBitSet.");
          }
          

          But then, changing the core filters to use OpenBitSets instead of
          Java BitSets is technically an API change, because BooleanFilter
          would not work anymore with the core filters.

          So if we took this approach we would have to wait until 3.0 to move
          the core from BitSet to OpenBitSet and also change BooleanFilter
          then to use OpenBitSets. BooleanFilter could then also work with
          either of the two BitSet implementions, but probably not with those
          two mixed.

          Any feedback about this is very welcome. I'll try to further think
          about how to marry the new Filter API, caching mechanism and Filter
          implementations like BooleanFilter nicely.

          Show
          Michael Busch added a comment - I think I understand now which problems you had when you wanted to change BooleanFilter and xml-query-parser to use the new Filter APIs. BooleanFilter is optimized to utilize BitSets for performing boolean operations fast. Now if we change BooleanFilter to use the new DocIdSetIterator, then it can't use the fast BitSet operations (e. g. union for or, intersect for and) anymore. Now we can introduce BitSetFilter as you suggested and what I did in the take4 patch. But here's the problem: Introducing subclasses of Filter doesn't play nicely with the caching mechanism in Lucene. For example: if we change BooleanFilter to only work with BitSetFilters, then it won't work with a CachingWrapperFilter anymore, because CachingWrapperFilter extends Filter. Then we would have to introduce new CachingWrapper***Filter, for the different kinds of Filter subclasses, which is a bad design as Mark pointed out in his comment: https://issues.apache.org/jira/browse/LUCENE-584?focusedCommentId=12547901#action_12547901 One solution would be to add a getBitSet() method to DocIdBitSet. DocIdBitSet is a new class that is basically just a wrapper around a Java BitSet and provides a DocIdSetIterator to access the BitSet. Then BooleanFilter could do something like this: DocIdSet docIdSet = filter.getDocIdSet(); if (docIdSet instanceof DocIdBitSet) { BitSet bits = ((DocIdBitSet) docIdSet).getBitSet(); ... // existing code } else { throw new UnsupportedOperationException("BooleanFilter only supports Filters that use DocIdBitSet."); } But then, changing the core filters to use OpenBitSets instead of Java BitSets is technically an API change, because BooleanFilter would not work anymore with the core filters. So if we took this approach we would have to wait until 3.0 to move the core from BitSet to OpenBitSet and also change BooleanFilter then to use OpenBitSets. BooleanFilter could then also work with either of the two BitSet implementions, but probably not with those two mixed. Any feedback about this is very welcome. I'll try to further think about how to marry the new Filter API, caching mechanism and Filter implementations like BooleanFilter nicely.
          Hide
          Paul Elschot added a comment -

          Upload once more, this time with licence.

          Show
          Paul Elschot added a comment - Upload once more, this time with licence.
          Hide
          Paul Elschot added a comment -

          I moved Filter "forward" by removing the deprecated bits() method
          and declaring the getDocIdSet() method abstract.

          To get ant test-core passing with this it was necessary to
          update some test code to implement Filter.getDocIdSet(),
          and a patch for that is attached: Test20080111.patch .

          The change to Filter is not included in the patch.
          Also, in a few places in the take4 patch Filter.bits() is used
          deprecated methods. These methods had to be removed, but this
          removal is not included in the patch.

          Show
          Paul Elschot added a comment - I moved Filter "forward" by removing the deprecated bits() method and declaring the getDocIdSet() method abstract. To get ant test-core passing with this it was necessary to update some test code to implement Filter.getDocIdSet(), and a patch for that is attached: Test20080111.patch . The change to Filter is not included in the patch. Also, in a few places in the take4 patch Filter.bits() is used deprecated methods. These methods had to be removed, but this removal is not included in the patch.
          Hide
          Eks Dev added a comment -

          hmm, in order to have fast and/or operations we need to know the type of the underlaying object in Filter, and sometimes we must use iterators (e.g. case where one Filter/DocSetId is int list and another Hash bit set ). I guess, knowing type of DocIdSet is the trick to pool.

          Default implementation of ChainedFilter (there is also BooleanFilter somewhere in contrib, I like it more) should be using iterator (like scorers), and at a few key points checking if(first instance of SomeInstanceOfDocIdSet && second SomeInstanceOfDocIdSet) first.doFastOR/AND(second);

          something in that direction looks reasonable to me for ChainedFilter
          If it proves to be really better to have it around. I am still of an opinion that it would be better to integrate DocIdSet into BooleanQuery as a clause, somehow, that would be some kind of ConstantBoolean(MUST/SHOULD/NOT)Clause, much cleaner from design/usability point of view, even at some minor penalty in performance (anyhow, you can always combine filters before you enter scorers) but you are right that is another issue... let us stop polluting this issue

          Show
          Eks Dev added a comment - hmm, in order to have fast and/or operations we need to know the type of the underlaying object in Filter, and sometimes we must use iterators (e.g. case where one Filter/DocSetId is int list and another Hash bit set ). I guess, knowing type of DocIdSet is the trick to pool. Default implementation of ChainedFilter (there is also BooleanFilter somewhere in contrib, I like it more) should be using iterator (like scorers), and at a few key points checking if(first instance of SomeInstanceOfDocIdSet && second SomeInstanceOfDocIdSet) first.doFastOR/AND(second); something in that direction looks reasonable to me for ChainedFilter If it proves to be really better to have it around. I am still of an opinion that it would be better to integrate DocIdSet into BooleanQuery as a clause, somehow, that would be some kind of ConstantBoolean(MUST/SHOULD/NOT)Clause, much cleaner from design/usability point of view, even at some minor penalty in performance (anyhow, you can always combine filters before you enter scorers) but you are right that is another issue... let us stop polluting this issue
          Hide
          Paul Elschot added a comment -

          it looks like ChainedFilter could become obsolete if Filter/DocSetIdIterator gets added as a Clause to the BooleanQuery?

          The function is indeed the same, but ChainedFilter works directly on BitSets and BooleanQuery works on input Scorers/DocIdSetIterators and outputs collected docids (and score values). Working directly on (Open)BitSets is normally faster, so ChainedFilter can have a good use.
          And boolean operations on DocIdSets are not (yet) directly available in Lucene. The various boolean scorers have the logic, but currently only for Scorers.

          That leaves the question on what to do with ChainedFilter here. Any ideas?
          The easiest way is to open another issue for it. This will have to be resolved before Filter.bits() is removed.

          Show
          Paul Elschot added a comment - it looks like ChainedFilter could become obsolete if Filter/DocSetIdIterator gets added as a Clause to the BooleanQuery? The function is indeed the same, but ChainedFilter works directly on BitSets and BooleanQuery works on input Scorers/DocIdSetIterators and outputs collected docids (and score values). Working directly on (Open)BitSets is normally faster, so ChainedFilter can have a good use. And boolean operations on DocIdSets are not (yet) directly available in Lucene. The various boolean scorers have the logic, but currently only for Scorers. That leaves the question on what to do with ChainedFilter here. Any ideas? The easiest way is to open another issue for it. This will have to be resolved before Filter.bits() is removed.
          Hide
          Eks Dev added a comment -

          it looks like ChainedFilter could become obsolete if Filter/DocSetIdIterator gets added as a Clause to the BooleanQuery? I am thinking along the lines: ChainedFilter evaluates boolean expression of docId-s, that is exactly what BooleanQuery does plus "a bit" more (scoring)...

          Show
          Eks Dev added a comment - it looks like ChainedFilter could become obsolete if Filter/DocSetIdIterator gets added as a Clause to the BooleanQuery? I am thinking along the lines: ChainedFilter evaluates boolean expression of docId-s, that is exactly what BooleanQuery does plus "a bit" more (scoring)...
          Hide
          Paul Elschot added a comment -

          I tried to move contrib from Filter.bits() to BitSetFilter.bits().

          The ContribQueries20080111.patch does that with contrib/queries,
          and with that applied the xml-query-parser tests still pass.
          I don't expect changes will be needed to xml-query-parser because it
          does not use Filter.bits().
          At the moment I don't know why I had problems with it half a year ago.

          For contrib/miscellaneous the changes needed to ChainedFilter are more involved:

          To make the tests pass, I had to make RangeFilter and QueryFilter subclasses of BitSetFilter, and to remove the final keyword from BitSetFilter.getDocIdSet().
          The alternative would be to add BitSet versions of RangeFilter and QueryFilter
          to ChainedFilterTest.
          So it can be made to work, but ChainedFilter and/or ChainedFilterTest will need to be changed.

          Stepping back a bit, I think ChainedFilter might better move to OpenBitSetFilter.
          No patch for contrib/miscelleneous, it's too ugly at the moment here.

          The conclusion is that I see no real problems with the take4 patch to move contrib
          from Filter.bits to BitSetFilter.bits.

          Show
          Paul Elschot added a comment - I tried to move contrib from Filter.bits() to BitSetFilter.bits(). The ContribQueries20080111.patch does that with contrib/queries, and with that applied the xml-query-parser tests still pass. I don't expect changes will be needed to xml-query-parser because it does not use Filter.bits(). At the moment I don't know why I had problems with it half a year ago. For contrib/miscellaneous the changes needed to ChainedFilter are more involved: To make the tests pass, I had to make RangeFilter and QueryFilter subclasses of BitSetFilter, and to remove the final keyword from BitSetFilter.getDocIdSet(). The alternative would be to add BitSet versions of RangeFilter and QueryFilter to ChainedFilterTest. So it can be made to work, but ChainedFilter and/or ChainedFilterTest will need to be changed. Stepping back a bit, I think ChainedFilter might better move to OpenBitSetFilter. No patch for contrib/miscelleneous, it's too ugly at the moment here. The conclusion is that I see no real problems with the take4 patch to move contrib from Filter.bits to BitSetFilter.bits.
          Hide
          Paul Elschot added a comment -

          I'm sorry about my PrefixGenerator remarks, I did not read your answer accurately.

          On the take4 patch of 11 Jan 2008:

          I have started in a fresh trunk checkout that passed all tests.
          Both parts of take4 apply cleanly, using patch -p0 < ... .
          ant jar, ant test-core and ant test-contrib all pass nicely.

          I remember having problems with moving contrib/xml-queryparser from Filter
          to BitSetFilter, see my comment of 30 July 2007.
          So I'd like to verify that this can be done, and I hope Mark Harwood can give
          some hints as to how to do this.

          For me, this was the main reason to make this move:
          from Filter with subclass BitSetFilter (as in the take4 patch, and in my first attempts)
          to MatchFilter with subclass Filter (as in Matcher... patches of Sep and Nov 2007).
          In these Matcher... patches no changes were necessary to contrib/xml-queryparser.

          Less important for now:

          The test classes extend TestCase, but iirc there is also a LuceneTestCase for this.

          On the take4 patch ant javadocs-core gives this:
          BitSetFilter.java:40: warning - Tag @link: reference not found: DocIdBitSetIterator

          Show
          Paul Elschot added a comment - I'm sorry about my PrefixGenerator remarks, I did not read your answer accurately. On the take4 patch of 11 Jan 2008: I have started in a fresh trunk checkout that passed all tests. Both parts of take4 apply cleanly, using patch -p0 < ... . ant jar, ant test-core and ant test-contrib all pass nicely. I remember having problems with moving contrib/xml-queryparser from Filter to BitSetFilter, see my comment of 30 July 2007. So I'd like to verify that this can be done, and I hope Mark Harwood can give some hints as to how to do this. For me, this was the main reason to make this move: from Filter with subclass BitSetFilter (as in the take4 patch, and in my first attempts) to MatchFilter with subclass Filter (as in Matcher... patches of Sep and Nov 2007). In these Matcher... patches no changes were necessary to contrib/xml-queryparser. Less important for now: The test classes extend TestCase, but iirc there is also a LuceneTestCase for this. On the take4 patch ant javadocs-core gives this: BitSetFilter.java:40: warning - Tag @link: reference not found: DocIdBitSetIterator
          Hide
          Michael Busch added a comment -

          BitSetFilter would inherit from Filter, and have an abstract bits() method, not deprecated.

          OK I added a BitSetFilter. The rest of the patch is identical to take3.

          I thought that I copied all your javadoc changes that still applied (after the removal of Matcher, MatchFilter, etc.) over into this patch. But well, if you think I missed any important ones let me know or feel free to update the patch!

          Show
          Michael Busch added a comment - BitSetFilter would inherit from Filter, and have an abstract bits() method, not deprecated. OK I added a BitSetFilter. The rest of the patch is identical to take3. I thought that I copied all your javadoc changes that still applied (after the removal of Matcher, MatchFilter, etc.) over into this patch. But well, if you think I missed any important ones let me know or feel free to update the patch!
          Hide
          Michael Busch added a comment -

          As for PrefixGenerator:
          in my (up to date) trunk directory, this command: find . -name 'PrefixGenerator'
          only gave this result: ./build/classes/java/org/apache/lucene/search/PrefixGenerator.class
          and that disappeared after ant clean.
          It seems that the source class was removed from the trunk.

          As I said, PrefixGenerator is defined in PrefixFilter.java.

          Show
          Michael Busch added a comment - As for PrefixGenerator: in my (up to date) trunk directory, this command: find . -name 'PrefixGenerator' only gave this result: ./build/classes/java/org/apache/lucene/search/PrefixGenerator.class and that disappeared after ant clean. It seems that the source class was removed from the trunk. As I said, PrefixGenerator is defined in PrefixFilter.java.
          Hide
          Paul Elschot added a comment -

          As for PrefixGenerator:
          in my (up to date) trunk directory, this command: find . -name 'PrefixGenerator'
          only gave this result: ./build/classes/java/org/apache/lucene/search/PrefixGenerator.class
          and that disappeared after ant clean.
          It seems that the source class was removed from the trunk.

          I think that it should be straightforward for users having filters that use
          BitSets to wrap the new DocIdBitSet around the BitSet, just as Filter currently
          does for backwards compatibility?

          BitSetFilter would inherit from Filter, and have an abstract bits() method, not deprecated.
          This would be useful for people that don't what to move to OpenBitSet yet.
          A rename (and maybe a package change) from Filter to BitSetFilter should be sufficient
          in their code to get rid of the deprecation warning for Filter.bits().

          OpenBitSetFilter similar, and that could be used in a few places in the patch iirc.

          The javadoc changes I meant came with Matcher and use 'match' consistently for documents
          that are collected during a query search.

          Show
          Paul Elschot added a comment - As for PrefixGenerator: in my (up to date) trunk directory, this command: find . -name ' PrefixGenerator ' only gave this result: ./build/classes/java/org/apache/lucene/search/PrefixGenerator.class and that disappeared after ant clean. It seems that the source class was removed from the trunk. I think that it should be straightforward for users having filters that use BitSets to wrap the new DocIdBitSet around the BitSet, just as Filter currently does for backwards compatibility? BitSetFilter would inherit from Filter, and have an abstract bits() method, not deprecated. This would be useful for people that don't what to move to OpenBitSet yet. A rename (and maybe a package change) from Filter to BitSetFilter should be sufficient in their code to get rid of the deprecation warning for Filter.bits(). OpenBitSetFilter similar, and that could be used in a few places in the patch iirc. The javadoc changes I meant came with Matcher and use 'match' consistently for documents that are collected during a query search.
          Hide
          Michael Busch added a comment -

          On the take3 patch of 10 Jan 2008:

          Thanks for the review!

          PrefixGenerator is used but not defined in the patch, so it will not compile.

          Not sure I understand what you mean. PrefixGenerator is (and was)
          defined in PrefixFilter.java. It compiles for me.

          There is neither a BitSetFilter nor an OpenBitSetFilter in the patch.
          These might be useful for existing code currently implementing Filter
          to overcome the deprecation of Filter.bits().
          With the current core moving to OpenBitSet, it might also use an
          explicit OpenBitSetFilter.

          I think that it should be straightforward for users having filters that use
          BitSets to wrap the new DocIdBitSet around the BitSet, just as Filter currently
          does for backwards compatibility?

          Some javadoc changes did not make it into the take3 patch, I'll check these later.

          Oh, which ones?

          FilteredQuery.explain(): When a document does not pass the Filter
          I think it would be better not to use setValue(0.0f) on the resulting
          Explanation. However, this may be necessary for backward compatibility.

          Yeah, it used to work this way, that's why I didn't change it for backwards-
          compatibility reasons.

          About adding a Filter as a clause to BooleanScorer, and adding
          DocSetIdIterator as a "Scorer" to ConjunctionScorer:
          This is the reason for the CHECKME in IndexSearcher for using
          ConjunctionScorer when a filter is given.
          A ConjunctionScorer that accepts a DocIdSetIterator could also be used in
          FilteredQuery.

          Well, let's address this with a different issue after this one is committed.
          I might have some concerns here, but I've to further think about it.

          Show
          Michael Busch added a comment - On the take3 patch of 10 Jan 2008: Thanks for the review! PrefixGenerator is used but not defined in the patch, so it will not compile. Not sure I understand what you mean. PrefixGenerator is (and was) defined in PrefixFilter.java. It compiles for me. There is neither a BitSetFilter nor an OpenBitSetFilter in the patch. These might be useful for existing code currently implementing Filter to overcome the deprecation of Filter.bits(). With the current core moving to OpenBitSet, it might also use an explicit OpenBitSetFilter. I think that it should be straightforward for users having filters that use BitSets to wrap the new DocIdBitSet around the BitSet, just as Filter currently does for backwards compatibility? Some javadoc changes did not make it into the take3 patch, I'll check these later. Oh, which ones? FilteredQuery.explain(): When a document does not pass the Filter I think it would be better not to use setValue(0.0f) on the resulting Explanation. However, this may be necessary for backward compatibility. Yeah, it used to work this way, that's why I didn't change it for backwards- compatibility reasons. About adding a Filter as a clause to BooleanScorer, and adding DocSetIdIterator as a "Scorer" to ConjunctionScorer: This is the reason for the CHECKME in IndexSearcher for using ConjunctionScorer when a filter is given. A ConjunctionScorer that accepts a DocIdSetIterator could also be used in FilteredQuery. Well, let's address this with a different issue after this one is committed. I might have some concerns here, but I've to further think about it.
          Hide
          Paul Elschot added a comment -

          On the take3 patch of 10 Jan 2008:

          SortedVIntList extends DocIdSet: nice, thanks.

          PrefixGenerator is used but not defined in the patch, so it will not compile.

          Nevertheless, with all tests passing, I think this is a good way to
          make Filter independent of BitSet.

          Minor concerns:

          There is neither a BitSetFilter nor an OpenBitSetFilter in the patch.
          These might be useful for existing code currently implementing Filter
          to overcome the deprecation of Filter.bits().
          With the current core moving to OpenBitSet, it might also use an
          explicit OpenBitSetFilter.

          Some javadoc changes did not make it into the take3 patch, I'll check these later.

          FilteredQuery.explain(): When a document does not pass the Filter
          I think it would be better not to use setValue(0.0f) on the resulting
          Explanation. However, this may be necessary for backward compatibility.

          For the future:

          About adding a Filter as a clause to BooleanScorer, and adding
          DocSetIdIterator as a "Scorer" to ConjunctionScorer:
          This is the reason for the CHECKME in IndexSearcher for using
          ConjunctionScorer when a filter is given.
          A ConjunctionScorer that accepts a DocIdSetIterator could also be used in
          FilteredQuery.

          Show
          Paul Elschot added a comment - On the take3 patch of 10 Jan 2008: SortedVIntList extends DocIdSet: nice, thanks. PrefixGenerator is used but not defined in the patch, so it will not compile. Nevertheless, with all tests passing, I think this is a good way to make Filter independent of BitSet. Minor concerns: There is neither a BitSetFilter nor an OpenBitSetFilter in the patch. These might be useful for existing code currently implementing Filter to overcome the deprecation of Filter.bits(). With the current core moving to OpenBitSet, it might also use an explicit OpenBitSetFilter. Some javadoc changes did not make it into the take3 patch, I'll check these later. FilteredQuery.explain(): When a document does not pass the Filter I think it would be better not to use setValue(0.0f) on the resulting Explanation. However, this may be necessary for backward compatibility. For the future: About adding a Filter as a clause to BooleanScorer, and adding DocSetIdIterator as a "Scorer" to ConjunctionScorer: This is the reason for the CHECKME in IndexSearcher for using ConjunctionScorer when a filter is given. A ConjunctionScorer that accepts a DocIdSetIterator could also be used in FilteredQuery.
          Hide
          Michael Busch added a comment -

          OK, I created a new patch based on the recent feedback.

          This patch introduces two new abstract classes:

          /**
           * A DocIdSet contains a set of doc ids. Implementing classes must provide
           * a {@link DocIdSetIterator} to access the set. 
           */
          public abstract class DocIdSet {
            public abstract DocIdSetIterator iterator();
          }
          
          /**
           * This abstract class defines methods to iterate over a set of
           * non-decreasing doc ids.
           */
          public abstract class DocIdSetIterator {
            public abstract int doc();
            
            public abstract boolean next() throws IOException;
              
            public abstract boolean skipTo(int target) throws IOException;
          }
          

          Additional changes:

          • Scorer extends DocIdSetIterator now.
          • Filter.bits(IndexReader) is not abstract anymore. It is deprecated and
            returns null. Added method Filter.getDocIdSet(IndexReader) which returns
            new DocIdBitSet(bits(reader)) for backwards compatibility.
          • Added getDocIdSet(IndexReader) implementations to all core readers and
            deprecated bits(IndexReader). The Filters (e. g. RangeFilter) are now
            using OpenBitSet instead of BitSet.
          • Made according changes to the SpanFilter classes.
          • Fixed a bug in OpenBitSetIterator. It has an integer curDocId now, which
            doc() returns. Before doc() sometimes returned wrong values. I also added
            TestOpenBitSet (take from current Solr trunk).

          I haven't changed the contrib Filter implementations yet, they still use
          the old (deprecated) API. We should address this with a different issue.

          All core & contrib tests pass. This patch should also be fully backwards
          compatible.

          Show
          Michael Busch added a comment - OK, I created a new patch based on the recent feedback. This patch introduces two new abstract classes: /** * A DocIdSet contains a set of doc ids. Implementing classes must provide * a {@link DocIdSetIterator} to access the set. */ public abstract class DocIdSet { public abstract DocIdSetIterator iterator(); } /** * This abstract class defines methods to iterate over a set of * non-decreasing doc ids. */ public abstract class DocIdSetIterator { public abstract int doc(); public abstract boolean next() throws IOException; public abstract boolean skipTo( int target) throws IOException; } Additional changes: Scorer extends DocIdSetIterator now. Filter.bits(IndexReader) is not abstract anymore. It is deprecated and returns null. Added method Filter.getDocIdSet(IndexReader) which returns new DocIdBitSet(bits(reader)) for backwards compatibility. Added getDocIdSet(IndexReader) implementations to all core readers and deprecated bits(IndexReader). The Filters (e. g. RangeFilter) are now using OpenBitSet instead of BitSet. Made according changes to the SpanFilter classes. Fixed a bug in OpenBitSetIterator. It has an integer curDocId now, which doc() returns. Before doc() sometimes returned wrong values. I also added TestOpenBitSet (take from current Solr trunk). I haven't changed the contrib Filter implementations yet, they still use the old (deprecated) API. We should address this with a different issue. All core & contrib tests pass. This patch should also be fully backwards compatible.
          Hide
          Paul Elschot added a comment -

          For the moment a DocId is an int, but that might change to long sooner than we think. So DocIdSet... would be a better name than IntegerSet..., and it's better to use an abstract superclass than an interface:

          abstract class DocIdSetIterator {
            boolean next();
            boolean skipTo(int next);
            int doc();
          }
          
          // and the rest is in the patch, except the superclass for Matcher:
          
          abstract class Matcher extends DocIdSetIterator {
            Explanation explain(int doc);
          }
          
          abstract class Scorer extends Matcher {
            float score();
            ...
          }
          

          Would this DocIdSetIterator be close enough to the IntegerSetIterator?

          Show
          Paul Elschot added a comment - For the moment a DocId is an int, but that might change to long sooner than we think. So DocIdSet... would be a better name than IntegerSet..., and it's better to use an abstract superclass than an interface: abstract class DocIdSetIterator { boolean next(); boolean skipTo( int next); int doc(); } // and the rest is in the patch, except the superclass for Matcher: abstract class Matcher extends DocIdSetIterator { Explanation explain( int doc); } abstract class Scorer extends Matcher { float score(); ... } Would this DocIdSetIterator be close enough to the IntegerSetIterator?
          Hide
          Mark Harwood added a comment -

          For the data structures (bitset/openbitset/sorted VintList/) I would suggest one of these: IntSet, IntegerSet or IntegerSequence as names for the common interface.
          I did a quick Google for IntegerSet and you are in the number one spot, Paul http://www.google.com/search?hl=en&q=integerset+bitset

          // A cachable, immutable, sorted, threadsafe collection of ints.
          interface IntegerSet

          { IntegerSetIterator getIterator(); int size(); //negative numbers could be used to represent estimates? }

          // A single-use thread-unsafe iterator.
          interface IntegerSetIterator

          { boolean next(); boolean skipTo(int next); int currentValue(); }

          If detailed explanations of hits are required these should really sit with the source not the result- i.e. with the Filters. They contain all the match criteria used to populate IntegerSets and can be thought of more generically as IntegerSetBuilder.

          //Contains criteria to create a set of matching documents. MUST implement hashcode and equals based on this criteria to enable use as cache keys for IntegerSets.
          interface IntegerSetBuilder extends Serializable

          { IntegerSet build (IndexReader reader) Explanation explain(int docNr); }

          A single CachingIntegerSetBuilder class would be able to take ANY IntegerSetBuilder as a source, cache ANY type of IntegerSet they produced and defer back to the original IntegerSetBuilder for a full and thorough explanation of a match even when the match occurred on a cached IntegerSet, if required.

          class CachingIntegerSetBuilder implements IntegerSetBuilder
          {
          private WeakHashMap perIndexReaderCache;
          public CachingIntegerSetBuilder(IntegerSetBuilder delegate)

          {....}

          .....
          }

          The reason for introducing IntegerSetBuilder as a more generic name than "Filter" is IntegerSets have uses outside of filtering e.g. to do category counts or clustering. In these use cases they don't actually perform anything to do with filtering. It may actually be better named DocIdSetBuilder given that it is tied to Lucene's IndexReader and therefore limited to producing sets of document ids.

          Show
          Mark Harwood added a comment - For the data structures (bitset/openbitset/sorted VintList/) I would suggest one of these: IntSet, IntegerSet or IntegerSequence as names for the common interface. I did a quick Google for IntegerSet and you are in the number one spot, Paul http://www.google.com/search?hl=en&q=integerset+bitset // A cachable, immutable, sorted, threadsafe collection of ints. interface IntegerSet { IntegerSetIterator getIterator(); int size(); //negative numbers could be used to represent estimates? } // A single-use thread-unsafe iterator. interface IntegerSetIterator { boolean next(); boolean skipTo(int next); int currentValue(); } If detailed explanations of hits are required these should really sit with the source not the result- i.e. with the Filters. They contain all the match criteria used to populate IntegerSets and can be thought of more generically as IntegerSetBuilder. //Contains criteria to create a set of matching documents. MUST implement hashcode and equals based on this criteria to enable use as cache keys for IntegerSets. interface IntegerSetBuilder extends Serializable { IntegerSet build (IndexReader reader) Explanation explain(int docNr); } A single CachingIntegerSetBuilder class would be able to take ANY IntegerSetBuilder as a source, cache ANY type of IntegerSet they produced and defer back to the original IntegerSetBuilder for a full and thorough explanation of a match even when the match occurred on a cached IntegerSet, if required. class CachingIntegerSetBuilder implements IntegerSetBuilder { private WeakHashMap perIndexReaderCache; public CachingIntegerSetBuilder(IntegerSetBuilder delegate) {....} ..... } The reason for introducing IntegerSetBuilder as a more generic name than "Filter" is IntegerSets have uses outside of filtering e.g. to do category counts or clustering. In these use cases they don't actually perform anything to do with filtering. It may actually be better named DocIdSetBuilder given that it is tied to Lucene's IndexReader and therefore limited to producing sets of document ids.
          Hide
          Paul Elschot added a comment -

          In case there is a better name than Matcher for a Scorer without a score() method (and maybe without an explain() method), I'm all ears. Names are important, and at this point they can still be changed very easily.

          For Matcher I'd rather have a method to estimate the number of matching docs than a size() method. This estimate would be useful in implementing conjunctions, as the Matchers with the lowest estimates could be used first. However, this is another issue.

          Show
          Paul Elschot added a comment - In case there is a better name than Matcher for a Scorer without a score() method (and maybe without an explain() method), I'm all ears. Names are important, and at this point they can still be changed very easily. For Matcher I'd rather have a method to estimate the number of matching docs than a size() method. This estimate would be useful in implementing conjunctions, as the Matchers with the lowest estimates could be used first. However, this is another issue.
          Hide
          Mark Harwood added a comment -

          I'm getting lost as to which patches we're considering here. I was looking at lucene-584-take2 patch.

          MatcherProvider in the earlier patch does look like something that will help with caching.

          >>Would those be a good starting point?

          Overall I feel uncomfortable with a lot of the classnames. I think the use of "Matcher" says more about what you want to do with the class in this particular case rather than what it does generally. I have other uses in mind for these classes that are outside of filtering search results. For me, these classes can be thought of much more simply as utility classes in the same mould as the java Collections API. Fundamentally, they are efficient implementations of sets/lists of integers with support for iterators. The whole thing would be a lot cleaner if classes were named around this scheme.
          "MatcherProvider" for example is essentially a DocIdSet which creates forms of DocIdSetIterators (Matchers) and could also usefully have a size() method.

          Show
          Mark Harwood added a comment - I'm getting lost as to which patches we're considering here. I was looking at lucene-584-take2 patch. MatcherProvider in the earlier patch does look like something that will help with caching. >>Would those be a good starting point? Overall I feel uncomfortable with a lot of the classnames. I think the use of "Matcher" says more about what you want to do with the class in this particular case rather than what it does generally. I have other uses in mind for these classes that are outside of filtering search results. For me, these classes can be thought of much more simply as utility classes in the same mould as the java Collections API. Fundamentally, they are efficient implementations of sets/lists of integers with support for iterators. The whole thing would be a lot cleaner if classes were named around this scheme. "MatcherProvider" for example is essentially a DocIdSet which creates forms of DocIdSetIterators (Matchers) and could also usefully have a size() method.
          Hide
          Paul Elschot added a comment -

          Mark, in the latest Matcher....-2default.patch there is the org.apache.lucene.MatcherProvider interface with this javadoc:

          /** To be used in a cache to implement caching for a MatchFilter. */

          This interface has only one method:

          public Matcher getMatcher();

          There is also a cache for filters in the Matcher....3core.patch in the class CachingWrapperFilter .

          Would those be a good starting point?

          Show
          Paul Elschot added a comment - Mark, in the latest Matcher....-2default.patch there is the org.apache.lucene.MatcherProvider interface with this javadoc: /** To be used in a cache to implement caching for a MatchFilter. */ This interface has only one method: public Matcher getMatcher(); There is also a cache for filters in the Matcher....3core.patch in the class CachingWrapperFilter . Would those be a good starting point?
          Hide
          Mark Harwood added a comment -

          To go back to post #1 on this topic:

          "Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint."

          Given the motivation to move to more memory efficient structures why is the only attempt at caching dedicated exclusively to caching the very structures we were trying to move away from?.....

          "I deprecated also CachingWrapperFilter and RemoteCachingWrapperFilter and added corresponding CachingBitSetFilter and RemoteCachingBitSetFilter"

          Does this suggest we are to have type-specific CachingXxxxxFilters and RemoteCachingXxxxxFilters created for every new filter type? Why not provide a single caching mechanism that works for all those other, new, more memory-efficient structures? I beleive the reason this hasn't been done is due to the issue I highlighted earlier - the cachable artefacts (what I chose to call "DocIdSet" here: action_12518642 ) are not modelled in a way which promotes re-use. That's why we would end up needing a specialised caching implementations for each type.

          If we are to move forward from the existing Lucene implementation it's important to note the change:

          • Filters currently produce, at great cost, BitSets. Bitsets provide both a cachable data structure and a thread-safe, reusable means of iterating across the contents.
          • By replacing BitSets with Matchers this proposal has removed an important aspect of the existing design - the visibility (and therefore cachability) of these expensive-to-recreate data structures. Matchers are single-use, non-threadsafe objects and hide the data structure over which they iterate. With this change if I want to implement a caching mechanism in my application I need to know the Filter type and what sort of data structure it returns and get it from it directly:
            if(myFilter instanceof BitSetFilter) wrap specific data structure using CachingBitSetFilter
            else
            if(myFilter instanceof OpenBitSetFilter) wrap specific data structure using CachingXxxxxFilter
            else...

          ...looks like an Anti-pattern to me. Worse, this ties the choice of datastructure to the type of Filter that produces it. Why can't my RangeFilter be free to create a SortedVIntList or a BitSet depending on the sparseness of matches for a particular set of criteria?

          I'm not saying "lets just stick with Bitsets", just consider caching more in the design. Post action_12518642 lays out how this could be modelled with the introduction of DocIdSet and DocIdSetIterator as separate responsibilities (whereas Matcher currently combines them both).

          Cheers
          Mark

          Show
          Mark Harwood added a comment - To go back to post #1 on this topic: "Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint." Given the motivation to move to more memory efficient structures why is the only attempt at caching dedicated exclusively to caching the very structures we were trying to move away from?..... "I deprecated also CachingWrapperFilter and RemoteCachingWrapperFilter and added corresponding CachingBitSetFilter and RemoteCachingBitSetFilter" Does this suggest we are to have type-specific CachingXxxxxFilters and RemoteCachingXxxxxFilters created for every new filter type? Why not provide a single caching mechanism that works for all those other, new, more memory-efficient structures? I beleive the reason this hasn't been done is due to the issue I highlighted earlier - the cachable artefacts (what I chose to call "DocIdSet" here: action_12518642 ) are not modelled in a way which promotes re-use. That's why we would end up needing a specialised caching implementations for each type. If we are to move forward from the existing Lucene implementation it's important to note the change: Filters currently produce, at great cost, BitSets. Bitsets provide both a cachable data structure and a thread-safe, reusable means of iterating across the contents. By replacing BitSets with Matchers this proposal has removed an important aspect of the existing design - the visibility (and therefore cachability) of these expensive-to-recreate data structures. Matchers are single-use, non-threadsafe objects and hide the data structure over which they iterate. With this change if I want to implement a caching mechanism in my application I need to know the Filter type and what sort of data structure it returns and get it from it directly: if(myFilter instanceof BitSetFilter) wrap specific data structure using CachingBitSetFilter else if(myFilter instanceof OpenBitSetFilter) wrap specific data structure using CachingXxxxxFilter else... ...looks like an Anti-pattern to me. Worse, this ties the choice of datastructure to the type of Filter that produces it. Why can't my RangeFilter be free to create a SortedVIntList or a BitSet depending on the sparseness of matches for a particular set of criteria? I'm not saying "lets just stick with Bitsets", just consider caching more in the design. Post action_12518642 lays out how this could be modelled with the introduction of DocIdSet and DocIdSetIterator as separate responsibilities (whereas Matcher currently combines them both). Cheers Mark
          Hide
          Paul Elschot added a comment -

          A few remarks on the lucene-584-take2 patch:

          In the @deprecated javadoc at Filter.bits() a reference to BitSetFilter could be added.

          While Filter.bits() is still deprecated, one could also use the BitSet in IndexSearcher
          in case this turns out to be performance sensitive; see also my remark of 28 November.

          A few complete (test) classes are deprecated, it might be good to add the target release
          for removal there.

          For the rest this patch looks good to me. Did you also run ant test-contrib ?

          Show
          Paul Elschot added a comment - A few remarks on the lucene-584-take2 patch: In the @deprecated javadoc at Filter.bits() a reference to BitSetFilter could be added. While Filter.bits() is still deprecated, one could also use the BitSet in IndexSearcher in case this turns out to be performance sensitive; see also my remark of 28 November. A few complete (test) classes are deprecated, it might be good to add the target release for removal there. For the rest this patch looks good to me. Did you also run ant test-contrib ?
          Hide
          Michael Busch added a comment -

          OK, here's a new version of the patch.
          Changes:

          • Removed MatchFilter entirely.
          • Filter.bits(IndexReader) is now deprecated and not abstract anymore.
            Filter.bits(IndexReader) returns null, and Filter.matcher(IndexReader)
            returns the new BitSetMatcher. This allows to create new Filters that
            only implement the new getMatcher(IndexReader) method. Existing filters
            on the other hand will still work together with the BitSetMatcher.
          • The new class BitSetFilter extends Filter and defines the abstract
            method bits(IndexReader), just as Filter did before this patch.
          • I deprecated also CachingWrapperFilter and RemoteCachingWrapperFilter
            and added corresponding CachingBitSetFilter and RemoteCachingBitSetFilter
            which do essentially the same but only accept BitSetFilters.

          All testcases pass and believe this should be fully backwards compatible.
          Also, all core classes that call Filter.bits() are deprecated themselves.

          In Lucene 3.0 we should make the following changes:

          • Remove Filter.bits() and define Filter.getMatcher() abstract.
          • Remove CachingWrapperFilter and RemoteCachingWrapperFilter
            (and corresponding testcases)
          • Add new matchers, like the OpenBitSetMatcher and SortedVIntMatcher
            and add the DefaultMatcher class. (these classes are all located in
            Paul's patch files)
          Show
          Michael Busch added a comment - OK, here's a new version of the patch. Changes: Removed MatchFilter entirely. Filter.bits(IndexReader) is now deprecated and not abstract anymore. Filter.bits(IndexReader) returns null, and Filter.matcher(IndexReader) returns the new BitSetMatcher. This allows to create new Filters that only implement the new getMatcher(IndexReader) method. Existing filters on the other hand will still work together with the BitSetMatcher. The new class BitSetFilter extends Filter and defines the abstract method bits(IndexReader), just as Filter did before this patch. I deprecated also CachingWrapperFilter and RemoteCachingWrapperFilter and added corresponding CachingBitSetFilter and RemoteCachingBitSetFilter which do essentially the same but only accept BitSetFilters. All testcases pass and believe this should be fully backwards compatible. Also, all core classes that call Filter.bits() are deprecated themselves. In Lucene 3.0 we should make the following changes: Remove Filter.bits() and define Filter.getMatcher() abstract. Remove CachingWrapperFilter and RemoteCachingWrapperFilter (and corresponding testcases) Add new matchers, like the OpenBitSetMatcher and SortedVIntMatcher and add the DefaultMatcher class. (these classes are all located in Paul's patch files)
          Hide
          Paul Elschot added a comment -

          I had not thought about deprecating yet, but it should work nicely.
          I suppose you want to add a class BitSetFilter (subclass of Filter) as the preferred alternative
          to the deprecated method?
          Initially Filter and BitSetFilter would be very similar, except that Filter.bits() would be deprecated.
          Later, after removal of Filter.bits(), Filter.getMatcher() would be declared abstract.

          I tried to do something pretty close to this for contrib/xml-query-parser, but I could not make that work,
          which is why I changed to adding a new superclass MatchFilter.
          Nevertheless, I think the deprecation above should work, but at the moment I can't foresee the consequences.

          Show
          Paul Elschot added a comment - I had not thought about deprecating yet, but it should work nicely. I suppose you want to add a class BitSetFilter (subclass of Filter) as the preferred alternative to the deprecated method? Initially Filter and BitSetFilter would be very similar, except that Filter.bits() would be deprecated. Later, after removal of Filter.bits(), Filter.getMatcher() would be declared abstract. I tried to do something pretty close to this for contrib/xml-query-parser, but I could not make that work, which is why I changed to adding a new superclass MatchFilter. Nevertheless, I think the deprecation above should work, but at the moment I can't foresee the consequences.
          Hide
          Michael Busch added a comment -

          What about adding the getMatcher() method to Filter and
          deprecating bits(IndexReader)? Then when we release
          3.0 we can remove bits() and the only method in Filter
          will be getMatcher().

          Then this patch should be backwards compatible
          and we'd do the API change with the next major release.
          Any objections?

          Show
          Michael Busch added a comment - What about adding the getMatcher() method to Filter and deprecating bits(IndexReader)? Then when we release 3.0 we can remove bits() and the only method in Filter will be getMatcher(). Then this patch should be backwards compatible and we'd do the API change with the next major release. Any objections?
          Hide
          Paul Elschot added a comment - - edited

          For example, OpenBitSetFilter would like this:

          class OpenBitSetFilter  /* ... */ {
            OpenBitSet bits(reader) { ... }
            Matcher getMatcher(reader) { ... }
          }
          

          Since the only thing needed by an IndexSearcher would be the Matcher,
          MatchFilteris what Filter and OpenBitSetFilter have in common:
          the getMatcher() method.

          Show
          Paul Elschot added a comment - - edited For example, OpenBitSetFilter would like this: class OpenBitSetFilter /* ... */ { OpenBitSet bits(reader) { ... } Matcher getMatcher(reader) { ... } } Since the only thing needed by an IndexSearcher would be the Matcher, MatchFilteris what Filter and OpenBitSetFilter have in common: the getMatcher() method.
          Hide
          Michael Busch added a comment -

          Why do we actually need the new MatchFilter class at all?
          Filter is an abstract class, not an interface. So I think we could
          simply add the new method getMatcher() like you already did
          in your patch:

            /**
             * @return A Matcher constructed from the provided BitSet.
             * @see    DefaultMatcher#defaultMatcher(BitSet)
             */
            public Matcher getMatcher(IndexReader reader) throws IOException {
              return new BitSetMatcher(bits(reader));
            }
          

          This shouldn't break existing Filter implementations?
          Maybe I'm missing an apparent reason why we need the MatchFilter
          class?

          Show
          Michael Busch added a comment - Why do we actually need the new MatchFilter class at all? Filter is an abstract class, not an interface. So I think we could simply add the new method getMatcher() like you already did in your patch: /** * @ return A Matcher constructed from the provided BitSet. * @see DefaultMatcher#defaultMatcher(BitSet) */ public Matcher getMatcher(IndexReader reader) throws IOException { return new BitSetMatcher(bits(reader)); } This shouldn't break existing Filter implementations? Maybe I'm missing an apparent reason why we need the MatchFilter class?
          Hide
          Paul Elschot added a comment -

          I tried implementing a Searchable, and indeed ran into compilation errors.
          So, backward compatibility is indeed not complete.

          Also, Searchable is an interface, so it should not be changed.
          In case there are other interfaces affected by the patch these should not be changed either.

          There are two ways out of this:

          Do a name change on MatcherFilter/Filter -> Filter/BitSetFilter.
          Changing the current Filter to BitSetFilter gives other problems with contrib packages.
          I tried this some time ago, see above, but I could not make it work.

          I'd prefer to add an interface (or abstract class?) like Searchable that uses MatchFilter for those implementers that want to take advantage of MatchFilter.
          I don't expect problems from leaving the Searchable interface available unchanged.
          Other interfaces that use Filter can be treated the same way, in case there are any.

          Show
          Paul Elschot added a comment - I tried implementing a Searchable, and indeed ran into compilation errors. So, backward compatibility is indeed not complete. Also, Searchable is an interface, so it should not be changed. In case there are other interfaces affected by the patch these should not be changed either. There are two ways out of this: Do a name change on MatcherFilter/Filter -> Filter/BitSetFilter. Changing the current Filter to BitSetFilter gives other problems with contrib packages. I tried this some time ago, see above, but I could not make it work. I'd prefer to add an interface (or abstract class?) like Searchable that uses MatchFilter for those implementers that want to take advantage of MatchFilter. I don't expect problems from leaving the Searchable interface available unchanged. Other interfaces that use Filter can be treated the same way, in case there are any.
          Hide
          Michael Busch added a comment -

          The patch is backwards compatible,

          I think that custom Searcher or Searchable implementations won't compile anymore?
          Because the signature of some abstract methods changed, e. g. in Searchable:

          @@ -86,13 +86,14 @@
              * <p>Called by {@link Hits}.
              *
              * <p>Applications should usually call {@link Searcher#search(Query)} or
          -   * {@link Searcher#search(Query,Filter)} instead.
          +   * {@link Searcher#search(Query,MatchFilter)} instead.
              * @throws BooleanQuery.TooManyClauses
              */
          -  TopDocs search(Weight weight, Filter filter, int n) throws IOException;
          +  TopDocs search(Weight weight, MatchFilter filter, int n) throws IOException;
          
          Show
          Michael Busch added a comment - The patch is backwards compatible, I think that custom Searcher or Searchable implementations won't compile anymore? Because the signature of some abstract methods changed, e. g. in Searchable: @@ -86,13 +86,14 @@ * <p>Called by {@link Hits}. * * <p>Applications should usually call {@link Searcher#search(Query)} or - * {@link Searcher#search(Query,Filter)} instead. + * {@link Searcher#search(Query,MatchFilter)} instead. * @ throws BooleanQuery.TooManyClauses */ - TopDocs search(Weight weight, Filter filter, int n) throws IOException; + TopDocs search(Weight weight, MatchFilter filter, int n) throws IOException;
          Hide
          Michael Busch added a comment -

          Yes you're right, I ran the tests w/ code coverage analysis enabled, and the
          BitSetMatcher is fully covered. Good!

          Show
          Michael Busch added a comment - Yes you're right, I ran the tests w/ code coverage analysis enabled, and the BitSetMatcher is fully covered. Good!
          Hide
          Paul Elschot added a comment -

          With the full patch applied, the following test cases use a BitSetMatcher:

          TestQueryParser
          TestComplexExplanations
          TestComplexExplanationsOfNonMatches
          TestConstantScoreRangeQuery
          TestDateFilter
          TestFilteredQuery
          TestMultiSearcherRanking
          TestPrefixFilter
          TestRangeFilter
          TestRemoteCachingWrapperFilter
          TestRemoteSearchable
          TestScorerPerf
          TestSimpleExplanations
          TestSimpleExplanationsOfNonMatches
          TestSort

          so I don't think it is necessary to provide seperate test cases.

          Show
          Paul Elschot added a comment - With the full patch applied, the following test cases use a BitSetMatcher: TestQueryParser TestComplexExplanations TestComplexExplanationsOfNonMatches TestConstantScoreRangeQuery TestDateFilter TestFilteredQuery TestMultiSearcherRanking TestPrefixFilter TestRangeFilter TestRemoteCachingWrapperFilter TestRemoteSearchable TestScorerPerf TestSimpleExplanations TestSimpleExplanationsOfNonMatches TestSort so I don't think it is necessary to provide seperate test cases.
          Hide
          Michael Busch added a comment -

          OK, here's a patch that compiles cleanly on current trunk and all tests
          pass. It includes:

          • all changes from Matcher-20071122-1ground.patch
          • util/BitSetMatcher.java from Matcher-20070905-2default.patch
          • Hits.java changes from Matcher-20070905-3core.patch
          • Filter#getMatcher() returns the BitSetMatcher

          Would you be up for providing testcases?

          As I said I haven't fully reviewed the patch, but I'm planning to do
          that soon. I can vouch that all tests pass after applying the patch.

          Show
          Michael Busch added a comment - OK, here's a patch that compiles cleanly on current trunk and all tests pass. It includes: all changes from Matcher-20071122-1ground.patch util/BitSetMatcher.java from Matcher-20070905-2default.patch Hits.java changes from Matcher-20070905-3core.patch Filter#getMatcher() returns the BitSetMatcher Would you be up for providing testcases? As I said I haven't fully reviewed the patch, but I'm planning to do that soon. I can vouch that all tests pass after applying the patch.
          Hide
          Paul Elschot added a comment -

          The patch is backwards compatible, except for current subclasses of Filter already have a getMatcher method. The fact that no changes are needed to contrib confirms the compatibility.

          I have made no performance tests on BitSetMatcher for two reasons.
          The first reason is that OpenBitSet is actually faster than BitSet (have a look at the graph in the SomeMatchers.zip file attachment by Eks Dev), so it seems to be better to go in that direction.
          The second is that it is easy to do the skipping in IndexSearcher on a BitSet directly by using nextSetBit on the BitSet instead of skipTo on the BitSetMatcher. For this it would only be necessary to check whether the given MatchFilter is a Filter.
          Anyway, I prefer to see where the real performance bottlenecks are before optimizing for performance.

          DefaultMatcher should be in the ...2default... patch.
          The change in Hits to use MatchFilter should be in the ...3core.. patch.

          So far, I never tried to use these patches on their own, I have only split them for a better overview. Splitting the combined patches to iterate would need a different split, as you found out. It might even be necessary to split within a single class, but I'll gladly do that.

          Show
          Paul Elschot added a comment - The patch is backwards compatible, except for current subclasses of Filter already have a getMatcher method. The fact that no changes are needed to contrib confirms the compatibility. I have made no performance tests on BitSetMatcher for two reasons. The first reason is that OpenBitSet is actually faster than BitSet (have a look at the graph in the SomeMatchers.zip file attachment by Eks Dev), so it seems to be better to go in that direction. The second is that it is easy to do the skipping in IndexSearcher on a BitSet directly by using nextSetBit on the BitSet instead of skipTo on the BitSetMatcher. For this it would only be necessary to check whether the given MatchFilter is a Filter. Anyway, I prefer to see where the real performance bottlenecks are before optimizing for performance. DefaultMatcher should be in the ...2default... patch. The change in Hits to use MatchFilter should be in the ...3core.. patch. So far, I never tried to use these patches on their own, I have only split them for a better overview. Splitting the combined patches to iterate would need a different split, as you found out. It might even be necessary to split within a single class, but I'll gladly do that.
          Hide
          Michael Busch added a comment -

          1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently.
          2. introduce MatchFilter as superclass of Filter and add a minimal DefaultMatcher to be used in IndexSearcher, i.e. add BitSetMatcher

          Paul, I like the iterative plan you suggested. I started reviewing the
          Matcher-20071122-1ground.patch. I've some question:

          • Is the API fully backwards compatible?
          • Did you make performance tests to check whether BitSetMatcher is
            slower than using a bitset directly?
          • With just the mentioned patch applied I get compile errors,
            because the DefaultMatcher is missing. Could you provide a patch that
            also includes the BitSetMatcher and Filter#getMatcher() returns it?
            Also I believe the patch should modify Hits.java to use MatchFilter
            instead of Filter? And a unit test that tests the BitSetMatcher
            would be nice!
          Show
          Michael Busch added a comment - 1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently. 2. introduce MatchFilter as superclass of Filter and add a minimal DefaultMatcher to be used in IndexSearcher, i.e. add BitSetMatcher Paul, I like the iterative plan you suggested. I started reviewing the Matcher-20071122-1ground.patch. I've some question: Is the API fully backwards compatible? Did you make performance tests to check whether BitSetMatcher is slower than using a bitset directly? With just the mentioned patch applied I get compile errors, because the DefaultMatcher is missing. Could you provide a patch that also includes the BitSetMatcher and Filter#getMatcher() returns it? Also I believe the patch should modify Hits.java to use MatchFilter instead of Filter? And a unit test that tests the BitSetMatcher would be nice!
          Hide
          Paul Elschot added a comment -

          Resolved a local conflict in the javadocs of HitCollector. This time with licence granted to ASF.

          Show
          Paul Elschot added a comment - Resolved a local conflict in the javadocs of HitCollector. This time with licence granted to ASF.
          Hide
          Paul Elschot added a comment -

          Resolved a local conflict in the javadocs of HitCollector.

          Show
          Paul Elschot added a comment - Resolved a local conflict in the javadocs of HitCollector.
          Hide
          Paul Elschot added a comment - - edited

          Attached once more, this time with licence granted to ASF.

          Show
          Paul Elschot added a comment - - edited Attached once more, this time with licence granted to ASF.
          Hide
          Paul Elschot added a comment -

          This Matcher-20071008-1ground.patch replaces the previous version because in between there was a conclict with the javadocs of Scorer for document ordering.

          In today's version, Scorer is unchanged, except for the superclass Matcher, and Matcher reuses the javadocs of Scorer as much as possible.

          Show
          Paul Elschot added a comment - This Matcher-20071008-1ground.patch replaces the previous version because in between there was a conclict with the javadocs of Scorer for document ordering. In today's version, Scorer is unchanged, except for the superclass Matcher, and Matcher reuses the javadocs of Scorer as much as possible.
          Hide
          Paul Elschot added a comment -

          As the current patch set is large, I've been pondering how to do this in a series of smaller patches that can each be applied by itself. This is possible in the following way:

          1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently.
          2. introduce MatchFilter as superclass of Filter and add a minimal DefaultMatcher to be used in IndexSearcher, i.e. add BitSetMatcher
          3. change the current Searcher/Searchable API to use MatchFilter instead of Filter.

          Step 1 can be reasonably done before a new a release.
          After step 2 this issue might be closed, and all the rest could be treated as new issues.

          After that three (almost) independent paths can be followed:
          4. add more data structures to be used for filter caches.
          5. adapt CachingWrapperFilter to provide a Matcher from a cached datastructure, for example SortedVIntList or BitSet or OpenBitSet.
          6. further use of Matcher, mostly in BooleanScorer2.

          My question is: shall I go ahead and provide a patch for step 1?

          At the moment I'm refining BooleanScorer2. to use Matcher. This is for the case of multiple prohibited clauses, and also to allow the use of required and prohibited Matchers to allow adding filtering clauses to BooleanQuery.

          Show
          Paul Elschot added a comment - As the current patch set is large, I've been pondering how to do this in a series of smaller patches that can each be applied by itself. This is possible in the following way: 1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently. 2. introduce MatchFilter as superclass of Filter and add a minimal DefaultMatcher to be used in IndexSearcher, i.e. add BitSetMatcher 3. change the current Searcher/Searchable API to use MatchFilter instead of Filter. Step 1 can be reasonably done before a new a release. After step 2 this issue might be closed, and all the rest could be treated as new issues. After that three (almost) independent paths can be followed: 4. add more data structures to be used for filter caches. 5. adapt CachingWrapperFilter to provide a Matcher from a cached datastructure, for example SortedVIntList or BitSet or OpenBitSet. 6. further use of Matcher, mostly in BooleanScorer2. My question is: shall I go ahead and provide a patch for step 1? At the moment I'm refining BooleanScorer2. to use Matcher. This is for the case of multiple prohibited clauses, and also to allow the use of required and prohibited Matchers to allow adding filtering clauses to BooleanQuery.
          Hide
          Paul Elschot added a comment -

          The posted patch proposes to use this class to determine which documents should be filtered:

          public abstract class Matcher

          { public abstract boolean next() throws IOException; public abstract boolean skipTo(int target) throws IOException; public abstract int doc(); // plus a few more methods }

          This class is then used as a superclass of org.apache.lucene.search.Scorer.

          Show
          Paul Elschot added a comment - The posted patch proposes to use this class to determine which documents should be filtered: public abstract class Matcher { public abstract boolean next() throws IOException; public abstract boolean skipTo(int target) throws IOException; public abstract int doc(); // plus a few more methods } This class is then used as a superclass of org.apache.lucene.search.Scorer.
          Hide
          robert engels added a comment -

          I don't think

          public interface AbstractBitSet

          is according to standards.

          It should just be

          public interface BitSet

          possibly

          public interface IBitSet

          if coming from the Windows world.

          Since it is in a different package, there is no collision with the standard BitSet class.

          Show
          robert engels added a comment - I don't think public interface AbstractBitSet is according to standards. It should just be public interface BitSet possibly public interface IBitSet if coming from the Windows world. Since it is in a different package, there is no collision with the standard BitSet class.
          Hide
          Paul Elschot added a comment -

          This time (20070905), as indicated in the previous post, a set of patches that add MatchFilter as the new superclass of Filter. Backward compatibility is quite good, no changes at all are necessary in contrib.

          In the 1ground patch, the current core API is moved from Filter to MatchFilter. Since Filter is a subclass of MatchFilter, I do not expect backward compatibility issues with this, but it is a quite extensive API change.

          In the 2default patch, some support for MatchFilter caching was added in classes DefaultMatcher and MatcherProvider. OpenBitSet and some support for that was added from solr here, but OpenBitSet is not used (yet). SortedVIntList is also added, and this is used for caching in CachingWrapperFilter as below.

          In the 3core patch, this caching support is used in CachingWrapperFilter. See also java-dev of yesterday and today for a fixed thread safety problem there.
          The remainder of the core code is also adapted to the use of Matcher in the 3core patch. ConstantScoreQuery is a nice example.

          I also added the Apache Licence to all new files.

          All tests pass with the patches applied, core and contrib.
          Quite a bit of javadoc is included, and the javadocs build with only one (unrelated) warning.

          These 3 patches modify 35 source code files, so please tread carefully. They were generated against revision 573048.
          I did some local svn mv, svn add, and svn rm, and I hope I got
          that right in the end. In case the patches do not apply cleanly, please holler.

          I will remove my previous patch set in a week or so.

          Show
          Paul Elschot added a comment - This time (20070905), as indicated in the previous post, a set of patches that add MatchFilter as the new superclass of Filter. Backward compatibility is quite good, no changes at all are necessary in contrib. In the 1ground patch, the current core API is moved from Filter to MatchFilter. Since Filter is a subclass of MatchFilter, I do not expect backward compatibility issues with this, but it is a quite extensive API change. In the 2default patch, some support for MatchFilter caching was added in classes DefaultMatcher and MatcherProvider. OpenBitSet and some support for that was added from solr here, but OpenBitSet is not used (yet). SortedVIntList is also added, and this is used for caching in CachingWrapperFilter as below. In the 3core patch, this caching support is used in CachingWrapperFilter. See also java-dev of yesterday and today for a fixed thread safety problem there. The remainder of the core code is also adapted to the use of Matcher in the 3core patch. ConstantScoreQuery is a nice example. I also added the Apache Licence to all new files. All tests pass with the patches applied, core and contrib. Quite a bit of javadoc is included, and the javadocs build with only one (unrelated) warning. These 3 patches modify 35 source code files, so please tread carefully. They were generated against revision 573048. I did some local svn mv, svn add, and svn rm, and I hope I got that right in the end. In case the patches do not apply cleanly, please holler. I will remove my previous patch set in a week or so.
          Hide
          Paul Elschot added a comment -

          Another way to decouple from BitSet would be to keep introduce a new superclass of Filter that only has an abstract getMatcher() method, and to add an implementation of that method in the current Filter class.
          That would boil down to the current patch with two classes renamed:
          Filter -> new class with abstract getMatcher() method.
          BitSetFilter -> Filter.

          This would avoid all backward compatibility issues, except for the unlikely case in which a getMatcher() method is already implemented in an existing subclass of Filter.
          Also, to take advantage of the independence of BitSet in other implementations, only this new class would need to be used.
          The only disadvantage I can see is that Filter is not renamed to BitSetFilter, which it actually is. But that can be fixed by making the javadoc of Filter explicit about the use of BitSet.

          For the lucene core and some of the contrib, this would mean that it would move to this new superclass of Filter. Again, I don't expect backward compatibility issues there.

          Does anyone see any problems with this approach?
          When not, what name should this new superclass of Filter have? I'm thinking of MatchFilter, any other suggestions?

          Show
          Paul Elschot added a comment - Another way to decouple from BitSet would be to keep introduce a new superclass of Filter that only has an abstract getMatcher() method, and to add an implementation of that method in the current Filter class. That would boil down to the current patch with two classes renamed: Filter -> new class with abstract getMatcher() method. BitSetFilter -> Filter. This would avoid all backward compatibility issues, except for the unlikely case in which a getMatcher() method is already implemented in an existing subclass of Filter. Also, to take advantage of the independence of BitSet in other implementations, only this new class would need to be used. The only disadvantage I can see is that Filter is not renamed to BitSetFilter, which it actually is. But that can be fixed by making the javadoc of Filter explicit about the use of BitSet. For the lucene core and some of the contrib, this would mean that it would move to this new superclass of Filter. Again, I don't expect backward compatibility issues there. Does anyone see any problems with this approach? When not, what name should this new superclass of Filter have? I'm thinking of MatchFilter, any other suggestions?
          Hide
          Paul Elschot added a comment -

          This set of patches indeed break backward compatibility with the current Filter class.
          That was done to show the ideal end situation, and to make sure that the patched code is indeed there.

          To get backward compatibility I'd prefer to temporarily copy the functionality from BitSetFilter into the Filter class, while still leaving BitSetFilter as it is:

          public class Filter {
          /** @deprecated use class BitSetFilter instead */
          public abstract BitSet bits(IndexReader reader);

          /** this method will become abstract once the bits() method is removed from Filter: */
          public Matcher getMatcher(IndexReader reader)

          {return DefaultFilter.defaultFilter(bits(reader));}

          }

          The main difference between the current set of patches and the removed patches is indicated
          in my first comment of 25 July 2007 above.
          I still have the older versions of the patch lying around here, so if you need a particular one, just indicated the date, and I'll repost or send it.

          Show
          Paul Elschot added a comment - This set of patches indeed break backward compatibility with the current Filter class. That was done to show the ideal end situation, and to make sure that the patched code is indeed there. To get backward compatibility I'd prefer to temporarily copy the functionality from BitSetFilter into the Filter class, while still leaving BitSetFilter as it is: public class Filter { /** @deprecated use class BitSetFilter instead */ public abstract BitSet bits(IndexReader reader); /** this method will become abstract once the bits() method is removed from Filter: */ public Matcher getMatcher(IndexReader reader) {return DefaultFilter.defaultFilter(bits(reader));} } The main difference between the current set of patches and the removed patches is indicated in my first comment of 25 July 2007 above. I still have the older versions of the patch lying around here, so if you need a particular one, just indicated the date, and I'll repost or send it.
          Hide
          Hoss Man added a comment -

          I, unfortunately, haven't had the time to read through everything in the latest patches, but catching up on my jira mail one of Paul's comments jumped out at me, so i wanted to make sure it's completley clear: this latest set of patches completely breaks backwards compatibility for any clients who have Filter subclasses, or methods that take a Filter as a param, since the Filter class now has an abstract getMatcher method and no longer supports an abstract BitSet method – presumably the expectation being that all client code should have a search/replace done from Filter=>BitSetFilter

          which begs the question: why not eliminate BitSetFilter and move it's getMatcher impl to the Filter class? (if the concern is just that there be a "higher level" class in which both methods are abstract, why not insert a parent with some new name above the Filter class?)

          For the record: it really bothers me that the old attachments got deleted ... the inability to refresh my memory by looking at the older patches and compare them with the current patches is extremely frustrating

          Show
          Hoss Man added a comment - I, unfortunately, haven't had the time to read through everything in the latest patches, but catching up on my jira mail one of Paul's comments jumped out at me, so i wanted to make sure it's completley clear: this latest set of patches completely breaks backwards compatibility for any clients who have Filter subclasses, or methods that take a Filter as a param, since the Filter class now has an abstract getMatcher method and no longer supports an abstract BitSet method – presumably the expectation being that all client code should have a search/replace done from Filter=>BitSetFilter which begs the question: why not eliminate BitSetFilter and move it's getMatcher impl to the Filter class? (if the concern is just that there be a "higher level" class in which both methods are abstract, why not insert a parent with some new name above the Filter class?) For the record: it really bothers me that the old attachments got deleted ... the inability to refresh my memory by looking at the older patches and compare them with the current patches is extremely frustrating
          Hide
          Mark Harwood added a comment -

          OK, I appreciate caching may not be a top priority in this proposal but I have live systems in production using XMLQueryParser and which use the existing core facilities for caching. As it stands this proposal breaks this functionality (see "FIXME" in contrib's CachedFilterBuilder and my concerns over use of unthreadsafe Matcher in the core class CachingWrapperFilter)

          I am obviously concerned by this and keen to help shape a solution which preserves the existing capabilities while adding your new functionality. I'm not sure I share your view that support for caching can be treated as a separate issue to be dealt with at a later date. There are a larger number of changes proposed in this patch and if the design does not at least consider future caching issues now, I suspect much will have to be reworked later. The change I can envisage most clearly is expressed in my concern that the DocIdSet and DocIdSetIterator services I outlined are being combined in Matcher as it stands now and these functions will have to be separated.

          Cheers
          Mark

          Show
          Mark Harwood added a comment - OK, I appreciate caching may not be a top priority in this proposal but I have live systems in production using XMLQueryParser and which use the existing core facilities for caching. As it stands this proposal breaks this functionality (see "FIXME" in contrib's CachedFilterBuilder and my concerns over use of unthreadsafe Matcher in the core class CachingWrapperFilter) I am obviously concerned by this and keen to help shape a solution which preserves the existing capabilities while adding your new functionality. I'm not sure I share your view that support for caching can be treated as a separate issue to be dealt with at a later date. There are a larger number of changes proposed in this patch and if the design does not at least consider future caching issues now, I suspect much will have to be reworked later. The change I can envisage most clearly is expressed in my concern that the DocIdSet and DocIdSetIterator services I outlined are being combined in Matcher as it stands now and these functions will have to be separated. Cheers Mark
          Hide
          Paul Elschot added a comment -

          Mark,

          I think we are one the same line, it's just that I don't want to go that far now.
          Have another look at the title of this issue, it may be in your title bar, but otherwise
          it's quite a bit of scrolling so I'll repeat it here: "Decouple Filter from BitSet".
          That is the main thing that this patch tries to do.

          And that also makes it a starting point for caching of different data structures for Filters.
          Caching of Filters is very much needed, but I'd rather see that as another issue.

          The DefaultMatcher class tries to do some compression by using a SortedVIntList when that is smaller than a BitSet, and that is about as far as I'd like to go now.

          Proost,
          Paul Elschot

          Show
          Paul Elschot added a comment - Mark, I think we are one the same line, it's just that I don't want to go that far now. Have another look at the title of this issue, it may be in your title bar, but otherwise it's quite a bit of scrolling so I'll repeat it here: "Decouple Filter from BitSet". That is the main thing that this patch tries to do. And that also makes it a starting point for caching of different data structures for Filters. Caching of Filters is very much needed, but I'd rather see that as another issue. The DefaultMatcher class tries to do some compression by using a SortedVIntList when that is smaller than a BitSet, and that is about as far as I'd like to go now. Proost, Paul Elschot
          Hide
          Mark Harwood added a comment -

          Hi Paul,

          Not sure we've reached a common understanding here yet.

          You said "That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList has a getMatcher() method, and I confused the two. "
          Ok, thanks for the clarification. I still feel uncomfortable because the method getMatcher() is not abstracted to a common interface. This was the thinking behind my "getIterator" method on DocIdSet.

          I too made a mistake in my earlier comments. DocIdSetIterator does NOT have "probably one implementation". There would be an implementation for each different type of DocIdSet (Bitset/OpenBitSet/VIntList).

          You said "some Filters do not need a cache. For example: TermFilter". I'm not sure why that has been singled out as not worthy of caching. I have certain terms (e.g. gender:male) where the TermDocs is very large (50% of all docs in the index!) so multiple calls to TermDocs for term "gender:male" (if that is what you are suggesting) is highly undesirable. These are typically handled in the XMLQueryParser using syntax like this:
          <CachedFilter>
          <TermsFilter fieldName="gender">male</TermsFilter>
          </CachedFilter>

          You said: "CachingWrapperFilter could then become a cache for BitSetFilter. "
          This means that the only caching strategy is one based on bitsets - does this not lose perhaps the main benefit of your whole proposal? - the ability to have alternative space efficient storage of sets of document ids e.g. SortedVIntList.

          If this is undesirable (my guess is "yes") then the proposal in my previous comment is a solution which allows for caching of any/all types of the new sets (openBitSet,BitSet,SortedVIntList etc) Regardless of my choice of class names or decisions over interfaces vs abstract classes do you not at least agree the need for 3 types of functionality:

          1) A factory for instantiating sets of document ids matching a particular set of criteria (which can be costly to call). While the factory is not expected to implement a caching strategy it is expected to implement hashcode/equals simply to aid any caching services which would need this help to identify previously instantiated sets which share the same criteria as ant new requests (This service I identified as my "DocIdSetFactory" and TermsFilter/RangeFilter would be example implementations).
          2) An object representing an instantiated set of document ids which can be cached and can create iterators for use in seperate threads (identified as my DocIdSet - example implementations being called something like BitSetDocSet, SortedVIntList)
          3) An iterator for a set of document ids (my DocIdSetIterator - example impls being called something like BitSetDocSetIterator SortedVIntListIterator)

          Each type of functionality can have different implementations so the functionality must be defined using an interface or abstract class.
          If we can agree this much as a set of responsibilities then we can begin to map these services onto something more concrete.

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Hi Paul, Not sure we've reached a common understanding here yet. You said "That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList has a getMatcher() method, and I confused the two. " Ok, thanks for the clarification. I still feel uncomfortable because the method getMatcher() is not abstracted to a common interface. This was the thinking behind my "getIterator" method on DocIdSet. I too made a mistake in my earlier comments. DocIdSetIterator does NOT have "probably one implementation". There would be an implementation for each different type of DocIdSet (Bitset/OpenBitSet/VIntList). You said "some Filters do not need a cache. For example: TermFilter". I'm not sure why that has been singled out as not worthy of caching. I have certain terms (e.g. gender:male) where the TermDocs is very large (50% of all docs in the index!) so multiple calls to TermDocs for term "gender:male" (if that is what you are suggesting) is highly undesirable. These are typically handled in the XMLQueryParser using syntax like this: <CachedFilter> <TermsFilter fieldName="gender">male</TermsFilter> </CachedFilter> You said: "CachingWrapperFilter could then become a cache for BitSetFilter. " This means that the only caching strategy is one based on bitsets - does this not lose perhaps the main benefit of your whole proposal? - the ability to have alternative space efficient storage of sets of document ids e.g. SortedVIntList. If this is undesirable (my guess is "yes") then the proposal in my previous comment is a solution which allows for caching of any/all types of the new sets (openBitSet,BitSet,SortedVIntList etc) Regardless of my choice of class names or decisions over interfaces vs abstract classes do you not at least agree the need for 3 types of functionality: 1) A factory for instantiating sets of document ids matching a particular set of criteria (which can be costly to call). While the factory is not expected to implement a caching strategy it is expected to implement hashcode/equals simply to aid any caching services which would need this help to identify previously instantiated sets which share the same criteria as ant new requests (This service I identified as my "DocIdSetFactory" and TermsFilter/RangeFilter would be example implementations). 2) An object representing an instantiated set of document ids which can be cached and can create iterators for use in seperate threads (identified as my DocIdSet - example implementations being called something like BitSetDocSet, SortedVIntList) 3) An iterator for a set of document ids (my DocIdSetIterator - example impls being called something like BitSetDocSetIterator SortedVIntListIterator) Each type of functionality can have different implementations so the functionality must be defined using an interface or abstract class. If we can agree this much as a set of responsibilities then we can begin to map these services onto something more concrete. Cheers Mark
          Hide
          Paul Elschot added a comment -

          Mark,

          I said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)"
          That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList has a getMatcher() method, and I confused the two.

          A Matcher is intended to be used in a single thread, so I don't expect thread safety problems.

          The problem for the XML parser is that with this patch, the implementing data structure of a Filter becomes
          unaccessible from the Filter class, so it cannot be cached from there.
          That means that some cached data structure will have to be chosen, and one way to do
          that is by using class BitSetFilter from the patch. This has a bits() method just like the current Filter class.
          CachingWrapperFilter could then become a cache for BitSetFilter.

          There is indeed no caching of filters in this patch.
          The reason for that is that some Filters do not need a cache. For example:
          class TermFilter {
          TermFilter(Term t)

          {this.term = t;}

          Matcher getMatcher(reader)

          {return new TermMatcher( reader.termDocs(this.term);}

          }
          TermMatcher does not exist (yet), but it could be easily introduced by leaving all the
          scoring out of the current TermScorer.

          As for DocIdSet, as long as this provides a Matcher as an iterator, it can be used to
          implement a (caching) filter.

          I don't think this patch complicates the implementation of caching strategies.
          For example one could define:
          class CachableFilter extends Filter

          { ... some methods to access the underlying data structure to be cached. ... }

          or write a similar adapter for some subclass of Filter and then write a FilterCache that caches these.

          I did consider defining Matcher as an interface, but I preferred not to do that because
          of the default explain() method in the Matcher class of the patch.

          Show
          Paul Elschot added a comment - Mark, I said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)" That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList has a getMatcher() method, and I confused the two. A Matcher is intended to be used in a single thread, so I don't expect thread safety problems. The problem for the XML parser is that with this patch, the implementing data structure of a Filter becomes unaccessible from the Filter class, so it cannot be cached from there. That means that some cached data structure will have to be chosen, and one way to do that is by using class BitSetFilter from the patch. This has a bits() method just like the current Filter class. CachingWrapperFilter could then become a cache for BitSetFilter. There is indeed no caching of filters in this patch. The reason for that is that some Filters do not need a cache. For example: class TermFilter { TermFilter(Term t) {this.term = t;} Matcher getMatcher(reader) {return new TermMatcher( reader.termDocs(this.term);} } TermMatcher does not exist (yet), but it could be easily introduced by leaving all the scoring out of the current TermScorer. As for DocIdSet, as long as this provides a Matcher as an iterator, it can be used to implement a (caching) filter. I don't think this patch complicates the implementation of caching strategies. For example one could define: class CachableFilter extends Filter { ... some methods to access the underlying data structure to be cached. ... } or write a similar adapter for some subclass of Filter and then write a FilterCache that caches these. I did consider defining Matcher as an interface, but I preferred not to do that because of the default explain() method in the Matcher class of the patch.
          Hide
          Mark Harwood added a comment -

          Some further thought on the roles/responsibilities of the various components:

          Given a blank sheet of paper (a luxury we may not have) the minimum requirements I would have could be met with the following:
          (note that use of the words "Matcher" and "Filter" etc have been removed because sets of doc IDs have applications outside of filtering/querying e.g. category counts)

          interface DocIdSetFactory

          { DocIdSet getDocIdSet(IndexReader reader) }

          This is more or less equivalent to the purpose of the existing "Filter" - different implementations define their own selection criteria and produce a set of matching doc Ids e.g. equivalent of RangeFilter. Each implementation must implement "hashcode" and "equals" methods based on it's criteria so the factory can be cached and reused (in the same way Query objects are expected to). The existing CachedFilterBuilder in the XMLQueryParser provides one example of a strategy for caching Filters using this facility.

          interface DocIdSet

          { DocIdSetIterator getIterator(); }

          This interface defines an immutable, threadsafe (and therefore cachable) collection of doc IDs. Different implementations provide space-efficient alternatives for sparse or heavily populated sets e.g. BitSet, OpenBitSet, SortedVIntList. As an example caching strategy - the existing CachingWrapperFilter would cache these objects in a WeakHashMap keyed on IndexReader.

          interface DocIdSetIterator

          { boolean next(); int getDoc(); ....etc }

          A thread unsafe, single use object, (probably with only one implementation) that is used to iterate across any DocIdSet. Not cachable and used by Scorers.

          In the existing proposal it feels like DocIdSet and DocIdSetIterator are rolled into one in the form of the Matcher which complicates/prevents caching strategies.

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Some further thought on the roles/responsibilities of the various components: Given a blank sheet of paper (a luxury we may not have) the minimum requirements I would have could be met with the following: (note that use of the words "Matcher" and "Filter" etc have been removed because sets of doc IDs have applications outside of filtering/querying e.g. category counts) interface DocIdSetFactory { DocIdSet getDocIdSet(IndexReader reader) } This is more or less equivalent to the purpose of the existing "Filter" - different implementations define their own selection criteria and produce a set of matching doc Ids e.g. equivalent of RangeFilter. Each implementation must implement "hashcode" and "equals" methods based on it's criteria so the factory can be cached and reused (in the same way Query objects are expected to). The existing CachedFilterBuilder in the XMLQueryParser provides one example of a strategy for caching Filters using this facility. interface DocIdSet { DocIdSetIterator getIterator(); } This interface defines an immutable, threadsafe (and therefore cachable) collection of doc IDs. Different implementations provide space-efficient alternatives for sparse or heavily populated sets e.g. BitSet, OpenBitSet, SortedVIntList. As an example caching strategy - the existing CachingWrapperFilter would cache these objects in a WeakHashMap keyed on IndexReader. interface DocIdSetIterator { boolean next(); int getDoc(); ....etc } A thread unsafe, single use object, (probably with only one implementation) that is used to iterate across any DocIdSet. Not cachable and used by Scorers. In the existing proposal it feels like DocIdSet and DocIdSetIterator are rolled into one in the form of the Matcher which complicates/prevents caching strategies. Cheers Mark
          Hide
          Mark Harwood added a comment -

          Hi Paul,
          Many thanks for your responses.
          Sorry for the delay in communications - just got back from 2 weeks holiday and slowly picking my way through this patch.

          You said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)"

          Did you mean BitSetFilter.getMatcher()? BitSetMatcher has no getMatcher method.

          If so, doesn't my original thread safety issue still stand? - CachingWrapperFilter is caching Matchers (not Filters which are factories for matchers).

          The existing approach of adding a <CachedFilter> tag around my XML-based query templates offers a major speed up in my applications and I don't see this supported in this patch currently which gives me some concern. This existing caching technique is based on the use of CachingWrapperFilter.

          The proposed framework seems to be missing a means of caching reusable, threadsafe Matchers in a type-independent fashion. One solution (which I think you may be suggesting with the "getMatcher" comment) is to cache Filter objects and use Filter.getMatcher(reader) as a factory method for thread-specific, single-use Matchers but this would suggest that any caching then becomes an implied responsibility/overhead of each Filter implementation. Not too great. CachingWrapperFilter is an example of a better design where the caching policy has been implemented in a single class and it can be used to decorate any Filter implementation (RangeFilter etc) with the required caching behaviour. Unfortunately with this proposed patch there is no way that any such single caching policy can work with any Filter because Matcher is not reusable/cachable. Time to remove any thread-specific state from Matcher?

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Hi Paul, Many thanks for your responses. Sorry for the delay in communications - just got back from 2 weeks holiday and slowly picking my way through this patch. You said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)" Did you mean BitSetFilter.getMatcher()? BitSetMatcher has no getMatcher method. If so, doesn't my original thread safety issue still stand? - CachingWrapperFilter is caching Matchers (not Filters which are factories for matchers). The existing approach of adding a <CachedFilter> tag around my XML-based query templates offers a major speed up in my applications and I don't see this supported in this patch currently which gives me some concern. This existing caching technique is based on the use of CachingWrapperFilter. The proposed framework seems to be missing a means of caching reusable, threadsafe Matchers in a type-independent fashion. One solution (which I think you may be suggesting with the "getMatcher" comment) is to cache Filter objects and use Filter.getMatcher(reader) as a factory method for thread-specific, single-use Matchers but this would suggest that any caching then becomes an implied responsibility/overhead of each Filter implementation. Not too great. CachingWrapperFilter is an example of a better design where the caching policy has been implemented in a single class and it can be used to decorate any Filter implementation (RangeFilter etc) with the required caching behaviour. Unfortunately with this proposed patch there is no way that any such single caching policy can work with any Filter because Matcher is not reusable/cachable. Time to remove any thread-specific state from Matcher? Cheers Mark
          Hide
          Paul Elschot added a comment -

          Some more remarks on the 20070730 patches.

          To recap, this introduces Matcher as a superclass of Scorer to take the role that BitSet currently has in Filter.

          The total number of java files changed/added by these patches is 47, so some extra care will be needed.
          The following issues are still pending:

          What approach should be taken for the API change to Filter (see above, 2 comments up)?

          I'd like to get all test cases to pass again. TestRemoteCachingWrapperFilter still does not pass, and
          I don't know why.

          For xml-query-parser in contrib I'd like to know in which direction to proceed (see 1 comment up).
          Does it make sense to try and get the TestQueryTemplateManager test to pass again?

          The ..default.. patch has taken OpenBitSet and friends from solr to have a default implementation.
          However, I have not checked whether there is unused code in there, so some trimming may still
          be appropriate.

          Once these issues have been resolved far enough, I would recommend to introduce this shortly after a release so there is some time to let things settle.

          Show
          Paul Elschot added a comment - Some more remarks on the 20070730 patches. To recap, this introduces Matcher as a superclass of Scorer to take the role that BitSet currently has in Filter. The total number of java files changed/added by these patches is 47, so some extra care will be needed. The following issues are still pending: What approach should be taken for the API change to Filter (see above, 2 comments up)? I'd like to get all test cases to pass again. TestRemoteCachingWrapperFilter still does not pass, and I don't know why. For xml-query-parser in contrib I'd like to know in which direction to proceed (see 1 comment up). Does it make sense to try and get the TestQueryTemplateManager test to pass again? The ..default.. patch has taken OpenBitSet and friends from solr to have a default implementation. However, I have not checked whether there is unused code in there, so some trimming may still be appropriate. Once these issues have been resolved far enough, I would recommend to introduce this shortly after a release so there is some time to let things settle.
          Hide
          Paul Elschot added a comment -

          Uploading the patches again, this time with the ASF license.

          Show
          Paul Elschot added a comment - Uploading the patches again, this time with the ASF license.
          Hide
          Paul Elschot added a comment -

          Some 20070730 patches to contrib using BitSetFilter.
          The contrib-misc and contrib-queries patches are reasonbly good,
          their tests pass and replacing Filter by BitSetFilter is right for them.

          However, I'm not happy with the contrib-xml patch to the xml-query parser.
          I had to criple some of the code and to disable the TestQueryTemplateManager test.
          I don't know how to get around this, basically because I don't know whether
          it is a good idea at all to move the xml-query-parser to BitSetFilter. It might be
          better to move it to Filter.getMatcher() instead, but I have no idea how to do this.

          Show
          Paul Elschot added a comment - Some 20070730 patches to contrib using BitSetFilter. The contrib-misc and contrib-queries patches are reasonbly good, their tests pass and replacing Filter by BitSetFilter is right for them. However, I'm not happy with the contrib-xml patch to the xml-query parser. I had to criple some of the code and to disable the TestQueryTemplateManager test. I don't know how to get around this, basically because I don't know whether it is a good idea at all to move the xml-query-parser to BitSetFilter. It might be better to move it to Filter.getMatcher() instead, but I have no idea how to do this.
          Hide
          Paul Elschot added a comment -

          A different take in the patches of 20070730.

          In this version class Filter has only one method:
          public abstract Matcher getMatcher(IndexReader).

          Class BitSetFilter is added as a subclass of Filter, and it has the familiar
          public abstract BitSet bits(IndexReader),
          as well as a default implementation of the getMatcher() method.

          In the ..core.. patch, and in the ..contrib.. patches (to follow), most uses of Filter simply replaced by BitSetFilter. This turned out to be an easy way of dealing
          with this API change in Filter.

          This change to Filter and its replacement by BitSetFilter could well be taking
          things too far for now, and I'd like to know whether other approaches
          are preferred.

          The ..default.. patch contains a default implementation of a Matcher from a BitSet, and it has OpenBitSet and friends from solr, as well as SortedVIntList as posted earlier.

          Show
          Paul Elschot added a comment - A different take in the patches of 20070730. In this version class Filter has only one method: public abstract Matcher getMatcher(IndexReader). Class BitSetFilter is added as a subclass of Filter, and it has the familiar public abstract BitSet bits(IndexReader), as well as a default implementation of the getMatcher() method. In the ..core.. patch, and in the ..contrib.. patches (to follow), most uses of Filter simply replaced by BitSetFilter. This turned out to be an easy way of dealing with this API change in Filter. This change to Filter and its replacement by BitSetFilter could well be taking things too far for now, and I'd like to know whether other approaches are preferred. The ..default.. patch contains a default implementation of a Matcher from a BitSet, and it has OpenBitSet and friends from solr, as well as SortedVIntList as posted earlier.
          Hide
          Paul Elschot added a comment -

          Mark,

          An easy way to keep things like BooleanFilter working could be to
          introduce a subclass of Filter, say BitsFilter that adds a bits(IndexReader) method.
          This class should also implement getMatcher(), the default implementation could
          be used for that initially.
          Then BooleanFilter could simply be a subclass of BitsFilter, possibly without further
          modifications, although I would prefer to rename it to BooleanBitsFilter.

          That would only involve some deprecation warnings in BitsFilter for the period
          that Filter.bits() is deprecated.

          I would not even mind cooking this up as patch to contrib.

          Thoughts?

          Show
          Paul Elschot added a comment - Mark, An easy way to keep things like BooleanFilter working could be to introduce a subclass of Filter, say BitsFilter that adds a bits(IndexReader) method. This class should also implement getMatcher(), the default implementation could be used for that initially. Then BooleanFilter could simply be a subclass of BitsFilter, possibly without further modifications, although I would prefer to rename it to BooleanBitsFilter. That would only involve some deprecation warnings in BitsFilter for the period that Filter.bits() is deprecated. I would not even mind cooking this up as patch to contrib. Thoughts?
          Hide
          Paul Elschot added a comment -

          Mark,

          The exhausted flag is only in the iterator/Matcher, not in the underlying set data structure. One can use as many iterators as necessary, for example one per thread, and then there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.)

          You wrote: I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations.

          That can still be done, but one needs to get to the BitSets for example by caching them outside the Filters and constructing the resulting BitSetMatcher for the combined Filter on the fly.

          An alternative would be to have a BooleanQuery.add(Matcher, Occur), where the occurrence can only be required or prohibited. Then there is no need to construct any resulting filter because the boolean logic will be executed during the search. This might even be more efficient than combining the full BitSets ahead of the search.

          And with many large BitSets cache memory savings from more compact implementations can also be helpful.

          Show
          Paul Elschot added a comment - Mark, The exhausted flag is only in the iterator/Matcher, not in the underlying set data structure. One can use as many iterators as necessary, for example one per thread, and then there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.) You wrote: I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations. That can still be done, but one needs to get to the BitSets for example by caching them outside the Filters and constructing the resulting BitSetMatcher for the combined Filter on the fly. An alternative would be to have a BooleanQuery.add(Matcher, Occur), where the occurrence can only be required or prohibited. Then there is no need to construct any resulting filter because the boolean logic will be executed during the search. This might even be more efficient than combining the full BitSets ahead of the search. And with many large BitSets cache memory savings from more compact implementations can also be helpful.
          Hide
          Mark Harwood added a comment -

          Thanks for the reply, Paul.

          I saw BitSetMatcher etc and appreciate the motivation behind the design for alternative implementations . What concerns me with the Matcher API in general is that Matchers have non-threadsafe safe state (i.e. the current position required to support next() )and as such aren't safely cachable in the same way as BitSets. I see the searcher code uses the safer skipTo() rather than next() but there's still the "if(exhausted)" thread safety problem to worry about which is why I raised points 1 and 4.

          Additionally, combining Bitsets using Booolean logic is one method call whereas combining heterogenous Matchers using Boolean logic requires iteration across them and therefore potentially many method calls (point 3). I haven't benchmarked this but I imagine it to be significantly slower?
          I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations.

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Thanks for the reply, Paul. I saw BitSetMatcher etc and appreciate the motivation behind the design for alternative implementations . What concerns me with the Matcher API in general is that Matchers have non-threadsafe safe state (i.e. the current position required to support next() )and as such aren't safely cachable in the same way as BitSets. I see the searcher code uses the safer skipTo() rather than next() but there's still the "if(exhausted)" thread safety problem to worry about which is why I raised points 1 and 4. Additionally, combining Bitsets using Booolean logic is one method call whereas combining heterogenous Matchers using Boolean logic requires iteration across them and therefore potentially many method calls (point 3). I haven't benchmarked this but I imagine it to be significantly slower? I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations. Cheers Mark
          Hide
          Paul Elschot added a comment -

          I forgot to mention that boolean logic on Matchers is already in present in BooleanScorer2.
          This is because each Scorer is a Matcher.

          Show
          Paul Elschot added a comment - I forgot to mention that boolean logic on Matchers is already in present in BooleanScorer2. This is because each Scorer is a Matcher.
          Hide
          Paul Elschot added a comment -

          Have a look at BitSetMatcher in the -default patch. It is constructed from a BitSet, and it has a method getMatcher() that returns a Matcher that acts as a searching iterator over the BitSet.

          So that is 1) to 4), at least potentially. A clone() method is currently not implemented iirc, but each call to getMatcher() will return a new iterator over the underlying BitSet. And when guaranteed non modifyability is needed, a constructor can take a copy of the given document set, in whatever form.

          The point of Matcher is that it allows other implementations than BitSet, like OpenBitSet and SortedVIntList. Both have the properties that you are looking for. SortedVIntList can
          save a lot of memory when compared to (Open)BitSet, and OpenBitSet is somewhat faster than BitSet.

          I'd like to have a skip list version of SortedVIntList, too. This would be slightly larger than SortedVIntList, but more efficient on skipTo().

          But the first thing that is necessary is to have Filter independent from BitSet.

          The real pain with that is going to be the code that currently implements Filters
          outside the lucene code base, and a default implementation of a Matcher
          should be of help there, just as it is in the -core patch now.

          The default implementation will probably need to be improved from its current
          state, but that can be done later. For example, one could also use OpenBitSet
          in all cases, and even collect the filtered documents directly in that.

          Cheers,
          Paul Elschot

          Show
          Paul Elschot added a comment - Have a look at BitSetMatcher in the -default patch. It is constructed from a BitSet, and it has a method getMatcher() that returns a Matcher that acts as a searching iterator over the BitSet. So that is 1) to 4), at least potentially. A clone() method is currently not implemented iirc, but each call to getMatcher() will return a new iterator over the underlying BitSet. And when guaranteed non modifyability is needed, a constructor can take a copy of the given document set, in whatever form. The point of Matcher is that it allows other implementations than BitSet, like OpenBitSet and SortedVIntList. Both have the properties that you are looking for. SortedVIntList can save a lot of memory when compared to (Open)BitSet, and OpenBitSet is somewhat faster than BitSet. I'd like to have a skip list version of SortedVIntList, too. This would be slightly larger than SortedVIntList, but more efficient on skipTo(). But the first thing that is necessary is to have Filter independent from BitSet. The real pain with that is going to be the code that currently implements Filters outside the lucene code base, and a default implementation of a Matcher should be of help there, just as it is in the -core patch now. The default implementation will probably need to be improved from its current state, but that can be done later. For example, one could also use OpenBitSet in all cases, and even collect the filtered documents directly in that. Cheers, Paul Elschot
          Hide
          Mark Harwood added a comment -

          Hi Paul,
          Not sure if I'm missing something but I think this patch may not work for scenarios other than the simple option of a single filter being used on a search.

          A Matcher does not have the same utility as a BitSet because using a BitSet you can:

          1) iterate across it using multiple threads.
          2) Clone it.
          3) Merge it quickly with other bitsets using Boolean logic .
          4) Use it more than once.

          I think these differences become important in the following scenarios :

          In CachingWrapperFilter I don't think you can cache Matchers instead of bitsets - because Matchers don't have features 1 and 4

          BooleanFilter and ChainedFilter in contrib don't work with Matchers because there is no support for 3)

          Is there something obvious I've missed?

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Hi Paul, Not sure if I'm missing something but I think this patch may not work for scenarios other than the simple option of a single filter being used on a search. A Matcher does not have the same utility as a BitSet because using a BitSet you can: 1) iterate across it using multiple threads. 2) Clone it. 3) Merge it quickly with other bitsets using Boolean logic . 4) Use it more than once. I think these differences become important in the following scenarios : In CachingWrapperFilter I don't think you can cache Matchers instead of bitsets - because Matchers don't have features 1 and 4 BooleanFilter and ChainedFilter in contrib don't work with Matchers because there is no support for 3) Is there something obvious I've missed? Cheers Mark
          Hide
          Paul Elschot added a comment -

          The whole thing in three patches:

          The Matcher-ground patch is the groundwork, which should be very similar to the earlier groundwork patch.

          The Matcher-default patch provides a default implementation, the same as the one I posted earlier today. Among others, It includes OpenBitSet from solr in org.apache.lucene.util .

          The Matcher-core uses the default implementation inside the rest of the lucene core and test code. It replaces Filter.bits() methods with Filter.getMatcher() methods in the subclasses of Filter.

          All core tests pass with these patches aplied., except the one below.
          I could not determine why this test fails, and the only
          reason I can think of now is that Matcher is not serializable.
          Could someone give me a clue on this?

          [junit] Testsuite: org.apache.lucene.search.TestRemoteCachingWrapperFilter
          [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.32 sec
          [junit]
          [junit] Testcase: testTermRemoteFilter(org.apache.lucene.search.TestRemoteCachingWrapperFilter): FAIL
          [junit] expected:<1> but was:<0>
          [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
          [junit] at org.apache.lucene.search.TestRemoteCachingWrapperFilter.search(TestRemoteCachingWrapperFiava:84)
          [junit] at org.apache.lucene.search.TestRemoteCachingWrapperFilter.testTermRemoteFilter(TestRemoteCarapperFilter.java:109)
          [junit]
          [junit]
          [junit] Test org.apache.lucene.search.TestRemoteCachingWrapperFilter FAILED

          Finally, contrib may not even compile with the patches applied.
          I used a version of Filter with an abstract getMatcher() method for the Matcher-core patch,
          and I also used that to cut the explicit BitSet things from my contrib working copy.
          However, I don't want to provide a patch for contrib yet, it's too far from ready here,
          and I'd like some comments on how to go about that first.

          Show
          Paul Elschot added a comment - The whole thing in three patches: The Matcher-ground patch is the groundwork, which should be very similar to the earlier groundwork patch. The Matcher-default patch provides a default implementation, the same as the one I posted earlier today. Among others, It includes OpenBitSet from solr in org.apache.lucene.util . The Matcher-core uses the default implementation inside the rest of the lucene core and test code. It replaces Filter.bits() methods with Filter.getMatcher() methods in the subclasses of Filter. All core tests pass with these patches aplied., except the one below. I could not determine why this test fails, and the only reason I can think of now is that Matcher is not serializable. Could someone give me a clue on this? [junit] Testsuite: org.apache.lucene.search.TestRemoteCachingWrapperFilter [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.32 sec [junit] [junit] Testcase: testTermRemoteFilter(org.apache.lucene.search.TestRemoteCachingWrapperFilter): FAIL [junit] expected:<1> but was:<0> [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0> [junit] at org.apache.lucene.search.TestRemoteCachingWrapperFilter.search(TestRemoteCachingWrapperFiava:84) [junit] at org.apache.lucene.search.TestRemoteCachingWrapperFilter.testTermRemoteFilter(TestRemoteCarapperFilter.java:109) [junit] [junit] [junit] Test org.apache.lucene.search.TestRemoteCachingWrapperFilter FAILED Finally, contrib may not even compile with the patches applied. I used a version of Filter with an abstract getMatcher() method for the Matcher-core patch, and I also used that to cut the explicit BitSet things from my contrib working copy. However, I don't want to provide a patch for contrib yet, it's too far from ready here, and I'd like some comments on how to go about that first.
          Hide
          Paul Elschot added a comment -

          There is some code in contrib where a Filter is assumed to have BitSet available:

          contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java
          contrib/miscellaneous/src/java/org/apache/lucene/misc/ChainedFilter.java

          When Filter is going to move from BitSet to Matcher, these will have to be reimplemented.
          They basically use Filters to provide BitSets, but it seems to me that they also
          could use lists of BitSets, for example.

          Show
          Paul Elschot added a comment - There is some code in contrib where a Filter is assumed to have BitSet available: contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java contrib/miscellaneous/src/java/org/apache/lucene/misc/ChainedFilter.java When Filter is going to move from BitSet to Matcher, these will have to be reimplemented. They basically use Filters to provide BitSets, but it seems to me that they also could use lists of BitSets, for example.
          Hide
          Paul Elschot added a comment -

          This DefaultMatcher2007072.patch adds a default Matcher to be used in Filter instead of the BitSet . It contains static methods that create a default Matcher from a BitSet and from an OpenBitSet. The patch also add OpenBitSet to org.apache.lucene.util; it was taken from a recent solr revision.

          In this way the deprecation of Filter.bits(IndexReader) can be done by replacing implementations of that method by Filter.getMatcher(IndexReader) and adding the above default Matcher in the return statement:
          return DefaultMatcher.defaultMatcher(bits);

          The idea is to have this hook available so that a sensible default Matcher is easily available, that can also be adapted to use better Matcher implementations when these become available.
          The current implementation uses a SortedVIntList when it is smaller than an (Open)BitSet.

          I have begun introducing the default matcher in my working copy of the core, but as expected, that turns out to be quite a bit of work.
          Before I continue with that, I'd like to have comments on this default matcher approach.

          Show
          Paul Elschot added a comment - This DefaultMatcher2007072.patch adds a default Matcher to be used in Filter instead of the BitSet . It contains static methods that create a default Matcher from a BitSet and from an OpenBitSet. The patch also add OpenBitSet to org.apache.lucene.util; it was taken from a recent solr revision. In this way the deprecation of Filter.bits(IndexReader) can be done by replacing implementations of that method by Filter.getMatcher(IndexReader) and adding the above default Matcher in the return statement: return DefaultMatcher.defaultMatcher(bits); The idea is to have this hook available so that a sensible default Matcher is easily available, that can also be adapted to use better Matcher implementations when these become available. The current implementation uses a SortedVIntList when it is smaller than an (Open)BitSet. I have begun introducing the default matcher in my working copy of the core, but as expected, that turns out to be quite a bit of work. Before I continue with that, I'd like to have comments on this default matcher approach.
          Hide
          Paul Elschot added a comment -

          With 2.2 out, and LUCENE-730 out of the way, wouldn't this be a good moment for some progress with this issue?
          The patch still applies cleanly, and I'd like to start working on a skipping extension of SortedVIntList, much like the latest index format for document lists.

          Show
          Paul Elschot added a comment - With 2.2 out, and LUCENE-730 out of the way, wouldn't this be a good moment for some progress with this issue? The patch still applies cleanly, and I'd like to start working on a skipping extension of SortedVIntList, much like the latest index format for document lists.
          Hide
          Hoss Man added a comment -

          I'm still behind on following this issue, but Otis: if you are interested in moving forward with this, you might consider trying the cahnges i proposed in my "15/Mar/07 11:06 AM" Comment...

          https://issues.apache.org/jira/browse/LUCENE-584#action_12481263

          ...I think it would keep IndexSearcher a little cleaner, and make it easier for people to migrate existing Filter's gradually (without requiring extra work for people writing new "Matcher" style Filters from scratch)

          Show
          Hoss Man added a comment - I'm still behind on following this issue, but Otis: if you are interested in moving forward with this, you might consider trying the cahnges i proposed in my "15/Mar/07 11:06 AM" Comment... https://issues.apache.org/jira/browse/LUCENE-584#action_12481263 ...I think it would keep IndexSearcher a little cleaner, and make it easier for people to migrate existing Filter's gradually (without requiring extra work for people writing new "Matcher" style Filters from scratch)
          Hide
          Otis Gospodnetic added a comment -

          Right. I was under the wrong impression that the Matcher also happens to avoid scoring. However, now that we've all looked at this patch (still applies cleanly and unit tests all pass), and nobody had any criticisms, I think we should commit it, say this Friday.

          As I'm in the performance squeezing mode, I'll go look at LUCENE-730, another one of Paul's great patches, and see if I can measure performance improvement there.

          Show
          Otis Gospodnetic added a comment - Right. I was under the wrong impression that the Matcher also happens to avoid scoring. However, now that we've all looked at this patch (still applies cleanly and unit tests all pass), and nobody had any criticisms, I think we should commit it, say this Friday. As I'm in the performance squeezing mode, I'll go look at LUCENE-730 , another one of Paul's great patches, and see if I can measure performance improvement there.
          Hide
          Hoss Man added a comment -

          I'm a little behind on following this issue, but if i can attempt to sum up the recent discussion about performance...

          "Migrating towards a "Matcher" API may allow some types of Queries to be faster in situations where clients can use a MatchCollector instead of a HitCollector, but this won't be a silver bullet performance win for all Query classes – just those where some of the score calculations is (or can be) isolated to the score method (as opposed to skipTO or next)"

          I think it's important to remember the motivation of this issue wasn't to improve the speed performance of non-scoring searchers, it was to decouple the concept of "Filtering" results away from needing to populate a (potentially large) BitSet when the logic neccessary for Filtering can easily be expressed in terms of a doc iterator (aka: a Matcher) – opening up the possibility of memory performance improvements.

          A second benefit that has arisen as the issue evolved, has been the API generalization of the "Matcher" concept to be a super class of Scorer for simpler APIs moving forward.

          Show
          Hoss Man added a comment - I'm a little behind on following this issue, but if i can attempt to sum up the recent discussion about performance... "Migrating towards a "Matcher" API may allow some types of Queries to be faster in situations where clients can use a MatchCollector instead of a HitCollector, but this won't be a silver bullet performance win for all Query classes – just those where some of the score calculations is (or can be) isolated to the score method (as opposed to skipTO or next)" I think it's important to remember the motivation of this issue wasn't to improve the speed performance of non-scoring searchers, it was to decouple the concept of "Filtering" results away from needing to populate a (potentially large) BitSet when the logic neccessary for Filtering can easily be expressed in terms of a doc iterator (aka: a Matcher) – opening up the possibility of memory performance improvements. A second benefit that has arisen as the issue evolved, has been the API generalization of the "Matcher" concept to be a super class of Scorer for simpler APIs moving forward.
          Hide
          Paul Elschot added a comment -

          By fastest cache I meant the L1 cache of the processor. The size is normally in tens of kilobytes.
          An array lookup hitting that cache takes about as much time as a floating point addition.

          During a query search the use of a.o. the term frequencies, the proximity data, and the document weights normally cause an L1 cache miss.

          I would expect that by not doing the score value computations, only the cache misses for document weights can be saved.

          Show
          Paul Elschot added a comment - By fastest cache I meant the L1 cache of the processor. The size is normally in tens of kilobytes. An array lookup hitting that cache takes about as much time as a floating point addition. During a query search the use of a.o. the term frequencies, the proximity data, and the document weights normally cause an L1 cache miss. I would expect that by not doing the score value computations, only the cache misses for document weights can be saved.
          Hide
          Otis Gospodnetic added a comment -

          Ah, too bad.
          Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel boxes, Intel boxes with Linux on them actually won, and my impression was that this was due to Niagara's weak FPU (a known weakness in Niagara, I believe). Thus, I thought, if we could just skip scoring and various floating point calculations, we'd see better performance, esp. on Niagara boxes.

          Paul, when you say "fastest cache", what exactly are you referring to? The Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at least the JVM had plenty of RAM to work with.

          Show
          Otis Gospodnetic added a comment - Ah, too bad. Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel boxes, Intel boxes with Linux on them actually won, and my impression was that this was due to Niagara's weak FPU (a known weakness in Niagara, I believe). Thus, I thought, if we could just skip scoring and various floating point calculations, we'd see better performance, esp. on Niagara boxes. Paul, when you say "fastest cache", what exactly are you referring to? The Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at least the JVM had plenty of RAM to work with.
          Hide
          Paul Elschot added a comment -

          That could be improved in a DisjunctionMatcher.
          With a bit of bookkeeping DisjunctionSumScorer could also delay calling score() on the subscorers
          but the bookkeeping would affect performance for the normal case.

          For the usual queries the score() call will never have much of a performance impact.
          The reason for this is that TermScorer.score() is really very efficient, iirc it caches
          weighted tf() values for low term frequencies.
          All the rest is mostly additions, and occasionally a multiplication for a coordination factor.

          To determine which documents match the query, the index need to be accessed,
          and that takes more time than score value computations because the complete index
          almost never fits in the fastest cache.

          Show
          Paul Elschot added a comment - That could be improved in a DisjunctionMatcher. With a bit of bookkeeping DisjunctionSumScorer could also delay calling score() on the subscorers but the bookkeeping would affect performance for the normal case. For the usual queries the score() call will never have much of a performance impact. The reason for this is that TermScorer.score() is really very efficient, iirc it caches weighted tf() values for low term frequencies. All the rest is mostly additions, and occasionally a multiplication for a coordination factor. To determine which documents match the query, the index need to be accessed, and that takes more time than score value computations because the complete index almost never fits in the fastest cache.
          Hide
          Marvin Humphrey added a comment -

          DisjunctionSumScorer (the ORScorer) actually calls Scorer.score() on all of the matching scorers in the ScorerDocQueue during next(), in order to accumulate an aggregate score. The MatchCollector can't save you from that.

          Show
          Marvin Humphrey added a comment - DisjunctionSumScorer (the ORScorer) actually calls Scorer.score() on all of the matching scorers in the ScorerDocQueue during next(), in order to accumulate an aggregate score. The MatchCollector can't save you from that.
          Hide
          Otis Gospodnetic added a comment -

          Ahhhh. I'll look at the patch again tomorrow and follow what you said. All this time I was under the impression that one of the points or at least side-effects of the Matcher was that scoring was skipped, which would be perfect where matches are ordered by anything other than relevance.

          Show
          Otis Gospodnetic added a comment - Ahhhh. I'll look at the patch again tomorrow and follow what you said. All this time I was under the impression that one of the points or at least side-effects of the Matcher was that scoring was skipped, which would be perfect where matches are ordered by anything other than relevance.
          Hide
          Doron Cohen added a comment -

          > No Scorer, no BooleanScorer(2), no ConjunctionScorer...

          Thanks, I was reading "score" instead of "score()"...

          But there is a scorer in the process, it is used for next()-ing to matched docs. So most of the work - preparing to be able to compute the scores - was done already. The scorer doc queue is created and populated. Not calling score() is saving the (final) looping on the scorers for aggregating their scores, multiplying by coord factor, etc. I assume this is why only a small speed up is seen.

          Show
          Doron Cohen added a comment - > No Scorer, no BooleanScorer(2), no ConjunctionScorer... Thanks, I was reading "score" instead of "score()"... But there is a scorer in the process, it is used for next()-ing to matched docs. So most of the work - preparing to be able to compute the scores - was done already. The scorer doc queue is created and populated. Not calling score() is saving the (final) looping on the scorers for aggregating their scores, multiplying by coord factor, etc. I assume this is why only a small speed up is seen.
          Hide
          Otis Gospodnetic added a comment -

          Doron: just to address your question from Apr/7 - I expect/hope to see an improvement in performance because of this difference:

          hc.collect(doc(), score());
          mc.collect(doc());

          the delta being the cost of the score() call that does the scoring. If I understand things correctly, that means that what grant described at the bottom of http://lucene.apache.org/java/docs/scoring.html will all be skipped. No Scorer, no BooleanScorer(2), no ConjunctionScorer...

          Show
          Otis Gospodnetic added a comment - Doron: just to address your question from Apr/7 - I expect/hope to see an improvement in performance because of this difference: hc.collect(doc(), score()); mc.collect(doc()); the delta being the cost of the score() call that does the scoring. If I understand things correctly, that means that what grant described at the bottom of http://lucene.apache.org/java/docs/scoring.html will all be skipped. No Scorer, no BooleanScorer(2), no ConjunctionScorer...
          Hide
          Doron Cohen added a comment -

          > > When you rerun, you may want to use my alg - to compare the two approaches in one run.
          > This is more dangerous though.

          Agree. I was trying to get rid of this by splitting each round to 3: - gc(), warm(), work() - when work() and warm() are the same, just that warm()'s stats are disregarded. Still switching the order of "by match" and "by bits" yield different results.

          Sometimes we would like not to disregard GC - in particular if one approach is creating more (or more complex) garbage than another approach.

          Perhaps we should look at two measures: best & avg/sum (2nd ignoring first run, for hotspot).

          Show
          Doron Cohen added a comment - > > When you rerun, you may want to use my alg - to compare the two approaches in one run. > This is more dangerous though. Agree. I was trying to get rid of this by splitting each round to 3: - gc(), warm(), work() - when work() and warm() are the same, just that warm()'s stats are disregarded. Still switching the order of "by match" and "by bits" yield different results. Sometimes we would like not to disregard GC - in particular if one approach is creating more (or more complex) garbage than another approach. Perhaps we should look at two measures: best & avg/sum (2nd ignoring first run, for hotspot).
          Hide
          Mike Klaas added a comment -

          Instead of discarding the first run, the approach I usually take is to run 3-4 times and pick the minimum. You can then run several of these "sets" and average over the minimum of each. GC is still an issues, though. It is hard to get around when it is a mark&sweep collector (reference counting is much friendlier in this regard)

          Show
          Mike Klaas added a comment - Instead of discarding the first run, the approach I usually take is to run 3-4 times and pick the minimum. You can then run several of these "sets" and average over the minimum of each. GC is still an issues, though. It is hard to get around when it is a mark&sweep collector (reference counting is much friendlier in this regard)
          Hide
          Yonik Seeley added a comment -

          > When you rerun, you may want to use my alg - to compare the two approaches in one run.

          This is more dangerous though. GC from one method's garbage can penalize the 2nd methods performance.
          Also, hotspot effects are hard to account for (if method1 and method2 use common methods, method2 will often execute faster than method one because more optimization has been done on those common methods).

          The hotspot effect can be minimized by running the test multiple times in the same JVM instance and discarding the first runs, but it's not so easy for GC.

          Show
          Yonik Seeley added a comment - > When you rerun, you may want to use my alg - to compare the two approaches in one run. This is more dangerous though. GC from one method's garbage can penalize the 2nd methods performance. Also, hotspot effects are hard to account for (if method1 and method2 use common methods, method2 will often execute faster than method one because more optimization has been done on those common methods). The hotspot effect can be minimized by running the test multiple times in the same JVM instance and discarding the first runs, but it's not so easy for GC.
          Hide
          Doron Cohen added a comment -

          ...right, your diff-txt had the Match tasks - I missed that - checked it, it is exactly what I did, so we're ok here.

          When you rerun, you may want to use my alg - to compare the two approaches in one run. You can run this by something like:
          ant run-task -Dtask.mem=256M -Dtask.alg=conf\matcher-vs-bitset.alg

          Also, to get cleaner results, add the line:
          ResetSystemSoft
          just in the beginning of the "search round" - this resets the (query) inputs and also calls GC.

          I tried like this twice, and got inconsistent results:

          When the bitset searches preceded the match searches:
          [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] SrchBitsSamRdr_5000 - 10 5000 706.4 70.78 7,511,219 16,573,645
          [java] SrchMtchSamRdr_5000 - - - - 10 - - - 5000 - - 689.6 - - 72.50 - 8,223,005 - 11,926,323
          [java] SrchBitsNewRdr_500 - 10 500 152.5 32.80 14,360,618 16,962,356
          [java] SrchMtchNewRdr_500 - - - - - 10 - - - 500 - - 171.3 - - 29.19 - 15,150,797 - 17,395,712

          When the match searches preceded the bitset searches:
          [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] SrchMtchSamRdr_5000 - 10 5000 763.5 65.49 9,563,243 17,128,244
          [java] SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 729.3 - - 68.56 - 10,003,775 - 13,001,114
          [java] SrchMtchNewRdr_500 - 10 500 175.7 28.46 12,068,559 17,524,326
          [java] SrchBitsNewRdr_500 - - - - - 10 - - - 500 - - 183.7 - - 27.22 - 15,098,480 - 17,974,476

          My conclusion from this is that the speed-up, if exists, is minor, at least for the setup of this test.

          There are only 15 unique queries in this test - also printed in the log - are these the queries you would expect to save in?

          I didn't follow this issue very closely so I don't know where the saving is expected here. Both SearchTask and MatchTask now do nothing in collect, so no difference at the actual collect() call.

          Also, Scorer.score(HitCollector) and Matcher.match(MatchCollector) are very similar:
          public void score(HitCollector hc) throws IOException {
          while (next())

          { hc.collect(doc(), score()); }

          }
          public void match(MatchCollector mc) throws IOException {
          while (next())

          { mc.collect(doc()); }

          }
          Especially for the case that the collect() method is doing nothing, as in this test.

          I think there is a potential gain for large boolean OR queries, because score() would have to call next() on all TermScorers and collect/sum their scores, while match() could use skipTo(last+1) because any match encountered is a match and there is no need to sum the individual scores for the same doc by other scorers. However as far as I can tell, current match() implementation does not take advantage of this, but I may be overlooking something?

          Show
          Doron Cohen added a comment - ...right, your diff-txt had the Match tasks - I missed that - checked it, it is exactly what I did, so we're ok here. When you rerun, you may want to use my alg - to compare the two approaches in one run. You can run this by something like: ant run-task -Dtask.mem=256M -Dtask.alg=conf\matcher-vs-bitset.alg Also, to get cleaner results, add the line: ResetSystemSoft just in the beginning of the "search round" - this resets the (query) inputs and also calls GC. I tried like this twice, and got inconsistent results: When the bitset searches preceded the match searches: [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] SrchBitsSamRdr_5000 - 10 5000 706.4 70.78 7,511,219 16,573,645 [java] SrchMtchSamRdr_5000 - - - - 10 - - - 5000 - - 689.6 - - 72.50 - 8,223,005 - 11,926,323 [java] SrchBitsNewRdr_500 - 10 500 152.5 32.80 14,360,618 16,962,356 [java] SrchMtchNewRdr_500 - - - - - 10 - - - 500 - - 171.3 - - 29.19 - 15,150,797 - 17,395,712 When the match searches preceded the bitset searches: [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] SrchMtchSamRdr_5000 - 10 5000 763.5 65.49 9,563,243 17,128,244 [java] SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 729.3 - - 68.56 - 10,003,775 - 13,001,114 [java] SrchMtchNewRdr_500 - 10 500 175.7 28.46 12,068,559 17,524,326 [java] SrchBitsNewRdr_500 - - - - - 10 - - - 500 - - 183.7 - - 27.22 - 15,098,480 - 17,974,476 My conclusion from this is that the speed-up, if exists, is minor, at least for the setup of this test. There are only 15 unique queries in this test - also printed in the log - are these the queries you would expect to save in? I didn't follow this issue very closely so I don't know where the saving is expected here. Both SearchTask and MatchTask now do nothing in collect, so no difference at the actual collect() call. Also, Scorer.score(HitCollector) and Matcher.match(MatchCollector) are very similar: public void score(HitCollector hc) throws IOException { while (next()) { hc.collect(doc(), score()); } } public void match(MatchCollector mc) throws IOException { while (next()) { mc.collect(doc()); } } Especially for the case that the collect() method is doing nothing, as in this test. I think there is a potential gain for large boolean OR queries, because score() would have to call next() on all TermScorers and collect/sum their scores, while match() could use skipTo(last+1) because any match encountered is a match and there is no need to sum the individual scores for the same doc by other scorers. However as far as I can tell, current match() implementation does not take advantage of this, but I may be overlooking something?
          Hide
          Otis Gospodnetic added a comment -

          Doron, thanks for jumping on this!

          1. I thought I'd see better performance with the Matcher because it skips scoring. While Paul's patch does make changes to the Filtering code, I'm more focused on HitCollector vs. MatchCollector performance here. Am I missing something here? If scoring is skipped, we should see at least some speed improvement, and your results show that.

          2. You said you did see MatchCollector was faster than HitCollector. Hmmm, weird, not in my 4 runs:

          Matcher:
          [java] SearchSameRdr_50000 - - - - - - - - 4 - - 50000 - - 1,064.7 - - 187.84 - 11,060,036 - 14,806,016
          HitCollector:
          [java] SearchSameRdr_50000 - - - - - - - - 4 - - 50000 - - 1,070.3 - - 186.86 - 10,500,146 - 13,821,952

          I'll try it again on a different computer. My previous runs were on a Mac with OSX.

          3. My bench-diff.txt did include Match tasks:

          $ grep Match bench-diff.txt | grep class
          public class SearchMatchTask extends MatchTask {
          public abstract class MatchTask extends ReadTask {

          ... but I didn't svn add them, so I produced the "diff" by simply cat-ing the new tasks to bench-diff.txt . So if you used my bench-diff.txt as a patch, it wouldn't have worked. Not a big deal, just clarifying.

          Show
          Otis Gospodnetic added a comment - Doron, thanks for jumping on this! 1. I thought I'd see better performance with the Matcher because it skips scoring. While Paul's patch does make changes to the Filtering code, I'm more focused on HitCollector vs. MatchCollector performance here. Am I missing something here? If scoring is skipped, we should see at least some speed improvement, and your results show that. 2. You said you did see MatchCollector was faster than HitCollector. Hmmm, weird, not in my 4 runs: Matcher: [java] SearchSameRdr_50000 - - - - - - - - 4 - - 50000 - - 1,064.7 - - 187.84 - 11,060,036 - 14,806,016 HitCollector: [java] SearchSameRdr_50000 - - - - - - - - 4 - - 50000 - - 1,070.3 - - 186.86 - 10,500,146 - 13,821,952 I'll try it again on a different computer. My previous runs were on a Mac with OSX. 3. My bench-diff.txt did include Match tasks: $ grep Match bench-diff.txt | grep class public class SearchMatchTask extends MatchTask { public abstract class MatchTask extends ReadTask { ... but I didn't svn add them, so I produced the "diff" by simply cat-ing the new tasks to bench-diff.txt . So if you used my bench-diff.txt as a patch, it wouldn't have worked. Not a big deal, just clarifying.
          Hide
          Doron Cohen added a comment -

          One line was cut out - here are the four lines again

          Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          SrchMtchSamRdr_5000 - 10 5000 642.2 77.85 12,331,866 16,408,576
          SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 586.9 - - 85.20 - 9,515,875 - 12,009,472
          SrchMtchNewRdr_500 - 10 500 134.7 37.11 13,376,113 17,171,660
          SrchBitsNewRdr_500 - - - - - 10 - - - 500 - - 154.0 - - 32.47 - 15,351,395 - 17,522,688

          Show
          Doron Cohen added a comment - One line was cut out - here are the four lines again Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem SrchMtchSamRdr_5000 - 10 5000 642.2 77.85 12,331,866 16,408,576 SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 586.9 - - 85.20 - 9,515,875 - 12,009,472 SrchMtchNewRdr_500 - 10 500 134.7 37.11 13,376,113 17,171,660 SrchBitsNewRdr_500 - - - - - 10 - - - 500 - - 154.0 - - 32.47 - 15,351,395 - 17,522,688
          Hide
          Doron Cohen added a comment -

          The benchmark does not search with filters. Is any speedup still expected? (why?)

          I applied the patch on current trunk and ran the benchmark - it shows that when all queries use the same reader, Match is faster while when each query opens its own reader bitset is faster. Is this an expected result?

          Operation           round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
          SrchMtchSamRdr_5000     -       10         5000        642.2       77.85    12,331,866     16,408,576
          SrchBitsSamRdr_5000 -   - -  -  10 -  -  - 5000 -  -   586.9 -  -  85.20 -   9,515,875 -   12,009,472
          SrchMtchNewRdr_500      -       10          500        134.7       37.11    13,376,113     17,171,660
          

          This test is using all Reuters documents and the searches rounds are repeated 10 times. The Match tasks were not included so I wrote them. The updated bench-diff.txt attached contains these task classes and the algorithm. (When you use this, note that once the index is created you can comment the first part - the "Populate" part - and then only rerun the querying part.)

          Show
          Doron Cohen added a comment - The benchmark does not search with filters. Is any speedup still expected? (why?) I applied the patch on current trunk and ran the benchmark - it shows that when all queries use the same reader, Match is faster while when each query opens its own reader bitset is faster. Is this an expected result? Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem SrchMtchSamRdr_5000 - 10 5000 642.2 77.85 12,331,866 16,408,576 SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 586.9 - - 85.20 - 9,515,875 - 12,009,472 SrchMtchNewRdr_500 - 10 500 134.7 37.11 13,376,113 17,171,660 This test is using all Reuters documents and the searches rounds are repeated 10 times. The Match tasks were not included so I wrote them. The updated bench-diff.txt attached contains these task classes and the algorithm. (When you use this, note that once the index is created you can comment the first part - the "Populate" part - and then only rerun the querying part.)
          Hide
          Otis Gospodnetic added a comment -

          Perhaps I did something wrong with the benchmark, but I didn't get any speed-up when using searcher.match(Query, MatchCollector) vs. searcher.search(Query, HitCollector).

          Here are the benchmark numbers (50000 queries with each), HitCollector first, MatchCollector second:

          HITCOLLECTOR:

          [java] ------------> Report Sum By (any) Name (11 about 41 out of 41)
          [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] Rounds_4 0 10 10 1 808020 787.5 1,026.04 7,217,624 17,780,736
          [java] Populate - - - - - - - - - - - - 4 - - - 2003 - - 129.9 - - 61.67 - 9,938,986 - 13,821,952
          [java] CreateIndex - - - 4 1 4.4 0.91 3,937,522 10,916,864
          [java] MAddDocs_2000 - - - - - - - - - - 4 - - - 2000 - - 138.1 - - 57.92 - 9,368,584 - 13,821,952
          [java] Optimize - - - 4 1 1.4 2.83 9,938,218 13,821,952
          [java] CloseIndex - - - - - - - - - - - 4 - - - - 1 - - 2,000.0 - - 0.00 - 9,938,986 - 13,821,952
          [java] OpenReader - - - 4 1 24.0 0.17 9,957,592 13,821,952
          [java] SearchSameRdr_50000 - - - - - - - - 4 - - 50000 - - 1,070.3 - - 186.86 - 10,500,146 - 13,821,952
          [java] CloseReader - - - 4 1 4,000.0 0.00 9,059,756 13,821,952
          [java] WarmNewRdr_50 - - - - - - - - - - 4 - - 100000 - 16,237.7 - - 24.63 - 9,060,268 - 13,821,952
          [java] SrchNewRdr_50000 - - - 4 50000 265.9 752.02 10,800,006 13,821,952

          [java] ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41)
          [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] MAddDocs_2000 0 10 10 1 2000 94.6 21.15 7,844,112 10,407,936
          [java] MAddDocs_2000 - 1 100 10 - - 1 - - - 2000 - - 136.7 - - 14.63 - 8,968,144 - 11,309,056
          [java] MAddDocs_2000 2 10 100 1 2000 173.2 11.55 10,528,264 15,740,928
          [java] MAddDocs_2000 - 3 100 100 - - 1 - - - 2000 - - 188.7 - - 10.60 - 10,133,816 - 17,829,888

          MATCHCOLLECTOR:

          [java] ------------> Report Sum By (any) Name (11 about 41 out of 41)
          [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] Rounds_4 0 10 10 1 808020 781.0 1,034.62 10,566,608 15,859,712
          [java] Populate - - - - - - - - - - - - 4 - - - 2003 - - 130.9 - - 61.23 - 10,963,452 - 14,806,016
          [java] CreateIndex - - - 4 1 33.9 0.12 3,616,570 11,020,288
          [java] MAddDocs_2000 - - - - - - - - - - 4 - - - 2000 - - 137.3 - - 58.29 - 10,445,568 - 14,806,016
          [java] Optimize - - - 4 1 1.4 2.82 10,979,398 14,806,016
          [java] CloseIndex - - - - - - - - - - - 4 - - - - 1 - - 2,000.0 - - 0.00 - 10,963,452 - 14,806,016
          [java] OpenReader - - - 4 1 22.0 0.18 10,982,058 14,806,016
          [java] SearchSameRdr_50000 - - - - - - - - 4 - - 50000 - - 1,064.7 - - 187.84 - 11,060,036 - 14,806,016
          [java] CloseReader - - - 4 1 4,000.0 0.00 10,353,206 14,806,016
          [java] WarmNewRdr_50 - - - - - - - - - - 4 - - 100000 - 16,419.0 - - 24.36 - 10,431,062 - 14,806,016
          [java] SrchNewRdr_50000 - - - 4 50000 263.0 760.34 11,912,358 14,806,016

          [java] ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41)
          [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] MAddDocs_2000 0 10 10 1 2000 92.2 21.69 7,844,112 10,407,936
          [java] MAddDocs_2000 - 1 100 10 - - 1 - - - 2000 - - 136.6 - - 14.64 - 7,720,352 - 10,407,936
          [java] MAddDocs_2000 2 10 100 1 2000 167.8 11.92 11,325,952 17,571,840
          [java] MAddDocs_2000 - 3 100 100 - - 1 - - - 2000 - - 199.3 - - 10.03 - 14,891,856 - 20,836,352

          This is what I did for the benchmark. I used Doron's handy conf/benchmark.
          I added a new .alg based on micro-standard.alg, here's the diff:

          $ diff conf/micro-standard.alg conf/matcher-micro-standard.alg
          60c60
          < { "SearchSameRdr" Search > : 50000

          > { "SearchSameRdr" SearchMatch > : 50000
          65c65
          < { "SrchNewRdr" Search > : 50000

          > { "SrchNewRdr" SearchMatch > : 50000

          Then I added 2 new Tasks for benchamrking the Matcher (searcher.search(Query, MatchCollector)) and modified the ReadTask to call searcher.search(Query, HitCollector) instead of the method to get Hits.

          I commented out all search results traversal and doc retrieval, as I didn't care to measure that.

          Show
          Otis Gospodnetic added a comment - Perhaps I did something wrong with the benchmark, but I didn't get any speed-up when using searcher.match(Query, MatchCollector) vs. searcher.search(Query, HitCollector). Here are the benchmark numbers (50000 queries with each), HitCollector first, MatchCollector second: HITCOLLECTOR: [java] ------------> Report Sum By (any) Name (11 about 41 out of 41) [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_4 0 10 10 1 808020 787.5 1,026.04 7,217,624 17,780,736 [java] Populate - - - - - - - - - - - - 4 - - - 2003 - - 129.9 - - 61.67 - 9,938,986 - 13,821,952 [java] CreateIndex - - - 4 1 4.4 0.91 3,937,522 10,916,864 [java] MAddDocs_2000 - - - - - - - - - - 4 - - - 2000 - - 138.1 - - 57.92 - 9,368,584 - 13,821,952 [java] Optimize - - - 4 1 1.4 2.83 9,938,218 13,821,952 [java] CloseIndex - - - - - - - - - - - 4 - - - - 1 - - 2,000.0 - - 0.00 - 9,938,986 - 13,821,952 [java] OpenReader - - - 4 1 24.0 0.17 9,957,592 13,821,952 [java] SearchSameRdr_50000 - - - - - - - - 4 - - 50000 - - 1,070.3 - - 186.86 - 10,500,146 - 13,821,952 [java] CloseReader - - - 4 1 4,000.0 0.00 9,059,756 13,821,952 [java] WarmNewRdr_50 - - - - - - - - - - 4 - - 100000 - 16,237.7 - - 24.63 - 9,060,268 - 13,821,952 [java] SrchNewRdr_50000 - - - 4 50000 265.9 752.02 10,800,006 13,821,952 [java] ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41) [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] MAddDocs_2000 0 10 10 1 2000 94.6 21.15 7,844,112 10,407,936 [java] MAddDocs_2000 - 1 100 10 - - 1 - - - 2000 - - 136.7 - - 14.63 - 8,968,144 - 11,309,056 [java] MAddDocs_2000 2 10 100 1 2000 173.2 11.55 10,528,264 15,740,928 [java] MAddDocs_2000 - 3 100 100 - - 1 - - - 2000 - - 188.7 - - 10.60 - 10,133,816 - 17,829,888 MATCHCOLLECTOR: [java] ------------> Report Sum By (any) Name (11 about 41 out of 41) [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_4 0 10 10 1 808020 781.0 1,034.62 10,566,608 15,859,712 [java] Populate - - - - - - - - - - - - 4 - - - 2003 - - 130.9 - - 61.23 - 10,963,452 - 14,806,016 [java] CreateIndex - - - 4 1 33.9 0.12 3,616,570 11,020,288 [java] MAddDocs_2000 - - - - - - - - - - 4 - - - 2000 - - 137.3 - - 58.29 - 10,445,568 - 14,806,016 [java] Optimize - - - 4 1 1.4 2.82 10,979,398 14,806,016 [java] CloseIndex - - - - - - - - - - - 4 - - - - 1 - - 2,000.0 - - 0.00 - 10,963,452 - 14,806,016 [java] OpenReader - - - 4 1 22.0 0.18 10,982,058 14,806,016 [java] SearchSameRdr_50000 - - - - - - - - 4 - - 50000 - - 1,064.7 - - 187.84 - 11,060,036 - 14,806,016 [java] CloseReader - - - 4 1 4,000.0 0.00 10,353,206 14,806,016 [java] WarmNewRdr_50 - - - - - - - - - - 4 - - 100000 - 16,419.0 - - 24.36 - 10,431,062 - 14,806,016 [java] SrchNewRdr_50000 - - - 4 50000 263.0 760.34 11,912,358 14,806,016 [java] ------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 41) [java] Operation round mrg buf runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] MAddDocs_2000 0 10 10 1 2000 92.2 21.69 7,844,112 10,407,936 [java] MAddDocs_2000 - 1 100 10 - - 1 - - - 2000 - - 136.6 - - 14.64 - 7,720,352 - 10,407,936 [java] MAddDocs_2000 2 10 100 1 2000 167.8 11.92 11,325,952 17,571,840 [java] MAddDocs_2000 - 3 100 100 - - 1 - - - 2000 - - 199.3 - - 10.03 - 14,891,856 - 20,836,352 This is what I did for the benchmark. I used Doron's handy conf/benchmark. I added a new .alg based on micro-standard.alg, here's the diff: $ diff conf/micro-standard.alg conf/matcher-micro-standard.alg 60c60 < { "SearchSameRdr" Search > : 50000 — > { "SearchSameRdr" SearchMatch > : 50000 65c65 < { "SrchNewRdr" Search > : 50000 — > { "SrchNewRdr" SearchMatch > : 50000 Then I added 2 new Tasks for benchamrking the Matcher (searcher.search(Query, MatchCollector)) and modified the ReadTask to call searcher.search(Query, HitCollector) instead of the method to get Hits. I commented out all search results traversal and doc retrieval, as I didn't care to measure that.
          Hide
          Yonik Seeley added a comment -

          > In case the speed advantage of removing this flag is preferred, I don't mind resolving the eventual conflict in my
          > working copy. But I don't know yet how I would resolve that conflict

          Ah, that's a bit different. You mean there are cases that are non-trivial to fix where next() is called after false is returned?

          Show
          Yonik Seeley added a comment - > In case the speed advantage of removing this flag is preferred, I don't mind resolving the eventual conflict in my > working copy. But I don't know yet how I would resolve that conflict Ah, that's a bit different. You mean there are cases that are non-trivial to fix where next() is called after false is returned?
          Hide
          Paul Elschot added a comment -

          > We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed.

          I fully agree. The "exhausted" flag is not much more than a matter of taste.

          In case the speed advantage of removing this flag is preferred, I don't mind resolving the eventual conflict in my working copy.
          But I don't know yet how I would resolve that conflict

          Show
          Paul Elschot added a comment - > We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed. I fully agree. The "exhausted" flag is not much more than a matter of taste. In case the speed advantage of removing this flag is preferred, I don't mind resolving the eventual conflict in my working copy. But I don't know yet how I would resolve that conflict
          Hide
          Yonik Seeley added a comment -

          > BitsMatcher could also work without the "exhausted" flag, but then an infinite loop
          > might occur when trying to continue after the first time next() or skipTo() returned false.
          > Continuing after false was returned in these cases is a bug, however an infinite loop
          > can be difficult to debug. I'd rather be on the safe side of that with the exhausted flag and wait for an actual
          > profile to show the performance problem.

          We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed.

          Show
          Yonik Seeley added a comment - > BitsMatcher could also work without the "exhausted" flag, but then an infinite loop > might occur when trying to continue after the first time next() or skipTo() returned false. > Continuing after false was returned in these cases is a bug, however an infinite loop > can be difficult to debug. I'd rather be on the safe side of that with the exhausted flag and wait for an actual > profile to show the performance problem. We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed.
          Hide
          Paul Elschot added a comment -

          Otis:

          > However, I need Sort and TopFieldDocs, and I don't see a match method with those.
          > Is there a reason why such a match method is not in the patch?

          A bit silly perhaps, but what sort criterion would like to have used when no score() value is available?

          I don't know the sorting code, but it might be possible to use a field value for sorting.
          In that case the sorting code for a Matcher would need to check whether the sort criterion does
          not imply the use of a score value.
          I personally have no use for sorting by field values, so that is why I never thought of combining this with a Matcher.

          Show
          Paul Elschot added a comment - Otis: > However, I need Sort and TopFieldDocs, and I don't see a match method with those. > Is there a reason why such a match method is not in the patch? A bit silly perhaps, but what sort criterion would like to have used when no score() value is available? I don't know the sorting code, but it might be possible to use a field value for sorting. In that case the sorting code for a Matcher would need to check whether the sort criterion does not imply the use of a score value. I personally have no use for sorting by field values, so that is why I never thought of combining this with a Matcher.
          Hide
          Otis Gospodnetic added a comment -

          Paul:
          Applied the patch, applied cleanly, run ant test -> BUILD SUCCESSFUL

          I'm primarily interested in using this in order to get matches, but avoid scoring. From what I can tell, I'd just need to switch to using the new match(Query, MatchCollector) method in IndexSearcher. However, I need Sort and TopFieldDocs, and I don't see a match method with those. Is there a reason why such a match method is not in the patch?

          Show
          Otis Gospodnetic added a comment - Paul: Applied the patch, applied cleanly, run ant test -> BUILD SUCCESSFUL I'm primarily interested in using this in order to get matches, but avoid scoring. From what I can tell, I'd just need to switch to using the new match(Query, MatchCollector) method in IndexSearcher. However, I need Sort and TopFieldDocs, and I don't see a match method with those. Is there a reason why such a match method is not in the patch?
          Hide
          Paul Elschot added a comment -

          Hoss,

          >Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses
          > it to decide whether or not to iterator over the (non null) Matcher, or over the BitSet
          > from Filter.bits. is there any reason that logic can't be put in getMatcher, so that if
          > subclasses of Filter don't override the getMatcher method it will call bits and then
          > return a Matcher that iterates over the set Bits?

          Two reasons:

          • uncertainty over performance of a Matcher instead of a BitSet,
          • this way backward compatibility very easily guaranteed.

          There is also LUCENE-730, which may interfere with the removal of BitSet,
          since it allows documents to be scored out of order. However, LUCENE-730
          should only be used at the top level of a query search and without a Filter.
          I cannot think of an actual case in which there might be interference, but
          I may not have not looked into that deep enough.

          > we could even change Filter.bits so it's no longer abstract ... it could have
          > an implementation that would call getMatcher, and iterate over all of the matched
          > docs setting bits on a BitSet that is then returned ... the class would still be
          > abstract, and the class javadocs would make it clear that subclasses must override
          > at least one of the methods...

          I must say that creating a BitSet from a Matcher never occurred to me.
          Anyway, when Filter.bits() is deprecated I have no preference about how
          it is actually removed.

          Show
          Paul Elschot added a comment - Hoss, >Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses > it to decide whether or not to iterator over the (non null) Matcher, or over the BitSet > from Filter.bits. is there any reason that logic can't be put in getMatcher, so that if > subclasses of Filter don't override the getMatcher method it will call bits and then > return a Matcher that iterates over the set Bits? Two reasons: uncertainty over performance of a Matcher instead of a BitSet, this way backward compatibility very easily guaranteed. There is also LUCENE-730 , which may interfere with the removal of BitSet, since it allows documents to be scored out of order. However, LUCENE-730 should only be used at the top level of a query search and without a Filter. I cannot think of an actual case in which there might be interference, but I may not have not looked into that deep enough. > we could even change Filter.bits so it's no longer abstract ... it could have > an implementation that would call getMatcher, and iterate over all of the matched > docs setting bits on a BitSet that is then returned ... the class would still be > abstract, and the class javadocs would make it clear that subclasses must override > at least one of the methods... I must say that creating a BitSet from a Matcher never occurred to me. Anyway, when Filter.bits() is deprecated I have no preference about how it is actually removed.
          Hide
          Hoss Man added a comment -

          It's been a while since i looked at this issue, but it's come up in discussion recently so i took another glance...

          Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses it to decide whether or not to iterator over the (non null) Matcher, or over the BitSet from Filter.bits. is there any reason that logic can't be put in getMatcher, so that if subclasses of Filter don't override the getMatcher method it will call bits and then return a Matcher that iterates over the set Bits?

          (this is the roll-out approach i advocated a while back when discussing this on email, excecept that at the time Matcher was refered to as SearchFilter: http://www.nabble.com/RE%3A-Filter-p2605271.html )

          Thinking about it now, we could even change Filter.bits so it's no longer abstract ... it could have an implementation that would call getMatcher, and iterate over all of the matched docs setting bits on a BitSet that is then returned ... the class would still be abstract, and the class javadocs would make it clear that subclasses must override at least one of the methods ... legacy Filters will work fine because they'll already have a bits method, and people writing new Filters will see that bits is deprecated, so they'll just write a getMatcher method and be done.

          This appears to be the same approach taken with Analyzer.tokenStream back in 1.4.3...

          http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/analysis/Analyzer.html

          Show
          Hoss Man added a comment - It's been a while since i looked at this issue, but it's come up in discussion recently so i took another glance... Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses it to decide whether or not to iterator over the (non null) Matcher, or over the BitSet from Filter.bits. is there any reason that logic can't be put in getMatcher, so that if subclasses of Filter don't override the getMatcher method it will call bits and then return a Matcher that iterates over the set Bits? (this is the roll-out approach i advocated a while back when discussing this on email, excecept that at the time Matcher was refered to as SearchFilter: http://www.nabble.com/RE%3A-Filter-p2605271.html ) Thinking about it now, we could even change Filter.bits so it's no longer abstract ... it could have an implementation that would call getMatcher, and iterate over all of the matched docs setting bits on a BitSet that is then returned ... the class would still be abstract, and the class javadocs would make it clear that subclasses must override at least one of the methods ... legacy Filters will work fine because they'll already have a bits method, and people writing new Filters will see that bits is deprecated, so they'll just write a getMatcher method and be done. This appears to be the same approach taken with Analyzer.tokenStream back in 1.4.3... http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/analysis/Analyzer.html
          Hide
          Paul Elschot added a comment -

          As 2.1 is out, here is a new patch to try and revive this.
          This replaces the pevious Matcheryyyymmdd.patch one of 2006

          Show
          Paul Elschot added a comment - As 2.1 is out, here is a new patch to try and revive this. This replaces the pevious Matcheryyyymmdd.patch one of 2006
          Hide
          Paul Elschot added a comment -

          I have just resolved some minor local conflicts on the updated copyrights of four java files.
          Please holler when a fresh patch is needed.

          Show
          Paul Elschot added a comment - I have just resolved some minor local conflicts on the updated copyrights of four java files. Please holler when a fresh patch is needed.
          Hide
          Paul Elschot added a comment -

          Corrected javadoc refs to use IndexInput and IndexOutput.

          Show
          Paul Elschot added a comment - Corrected javadoc refs to use IndexInput and IndexOutput.
          Hide
          Paul Elschot added a comment -

          I wrote:

          > One could add an abstract Scorer.explain() to catch these, or
          > provide a default implementation for Scorer.explain().

          by mistake. The good news is that the patch leaves the
          the existing abstract Scorer.explain() method unaffected.

          Show
          Paul Elschot added a comment - I wrote: > One could add an abstract Scorer.explain() to catch these, or > provide a default implementation for Scorer.explain(). by mistake. The good news is that the patch leaves the the existing abstract Scorer.explain() method unaffected.
          Hide
          Paul Elschot added a comment -

          In the inheritance from Matcher to Scorer there is an asymmetry
          in this patch.

          Matcher provides a default implementation for Matcher.explain()
          but Scorer does not, and this might lead to unexpected surprises
          for future Scorers when the current Matcher.explain() is used.
          One could add an abstract Scorer.explain() to catch these, or
          provide a default implementation for Scorer.explain().

          With matcher implementations quite a few other implementation
          decisions need to be taken.
          Also any place in the current code where a Scorer is used, but none
          of the Scorer.score() methods, is a candidate for a change from
          Scorer to Matcher.
          This will be mostly the current filtering implementations,
          but ConstantScoringQuery is another nice example.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - In the inheritance from Matcher to Scorer there is an asymmetry in this patch. Matcher provides a default implementation for Matcher.explain() but Scorer does not, and this might lead to unexpected surprises for future Scorers when the current Matcher.explain() is used. One could add an abstract Scorer.explain() to catch these, or provide a default implementation for Scorer.explain(). With matcher implementations quite a few other implementation decisions need to be taken. Also any place in the current code where a Scorer is used, but none of the Scorer.score() methods, is a candidate for a change from Scorer to Matcher. This will be mostly the current filtering implementations, but ConstantScoringQuery is another nice example. Regards, Paul Elschot
          Hide
          Paul Elschot added a comment -

          > Do you, or someone else see now things to be done before commiting this?

          Yes. In the steps listed here:
          http://wiki.apache.org/jakarta-lucene/HowToContribute
          the next step is to be patient.
          Wether being patient is something that can be done
          is open question...

          Paul Elschot.

          Show
          Paul Elschot added a comment - > Do you, or someone else see now things to be done before commiting this? Yes. In the steps listed here: http://wiki.apache.org/jakarta-lucene/HowToContribute the next step is to be patient. Wether being patient is something that can be done is open question... Paul Elschot.
          Hide
          Eks Dev added a comment -

          Paul,
          What is next now, we did on our app enough experiments and are now sure that this patch causes no incompatibilities.
          We also tried to replace our filters with OpenBitSet and VInt matchers and results there are more than good, our app showed crazy 30% speed-up!!! Hard to identify where from exactly, but I suspect VInt matcher in case of not too dense BitVectors increased our Filter Cache utilization significantly.

          I would propose to commit this patch before we go further with something that would actually utilize Matcher. Just to avoid creating monster patch on patch ...

          This is ground work, and now using Matcher will be pure poetry, I see a lot of places we could see beter life by using use Matchers, ConstantScoringQuery, PreffixFilter, ChainedFilter (becomes obsolete now)... actually replace all uses of BitSet with OpenBitSet (or a bit smarter with SortedIntList. VInt...)...
          Than question here, do we create dependancy to Solr from Lucene, or we "migrate" OpenBitSet to Lucene (as this dependancy allready exists) or we copy-paste and have two OpenBitSets, Yonik? As far as I am concerned, makes no real diference.

          Do you, or someone else see now things to be done before commiting this?

          Show
          Eks Dev added a comment - Paul, What is next now, we did on our app enough experiments and are now sure that this patch causes no incompatibilities. We also tried to replace our filters with OpenBitSet and VInt matchers and results there are more than good, our app showed crazy 30% speed-up!!! Hard to identify where from exactly, but I suspect VInt matcher in case of not too dense BitVectors increased our Filter Cache utilization significantly. I would propose to commit this patch before we go further with something that would actually utilize Matcher. Just to avoid creating monster patch on patch ... This is ground work, and now using Matcher will be pure poetry, I see a lot of places we could see beter life by using use Matchers, ConstantScoringQuery, PreffixFilter, ChainedFilter (becomes obsolete now)... actually replace all uses of BitSet with OpenBitSet (or a bit smarter with SortedIntList. VInt...)... Than question here, do we create dependancy to Solr from Lucene, or we "migrate" OpenBitSet to Lucene (as this dependancy allready exists) or we copy-paste and have two OpenBitSets, Yonik? As far as I am concerned, makes no real diference. Do you, or someone else see now things to be done before commiting this?
          Hide
          Eks Dev added a comment -

          Here are some Matcher implementations,

          • OpenBitsMatcher- the same as the code Paul wrote for BitsMatcher, with replaced OpenBitSet instead

          -DenseOpenBitsMatcher - Using solr BitSetIterator (for skipTo() to work, one method in BitSetIterator should become public)

          Also attached one simple test (just basic fuctionality) that also contains one dummy relative performance test

          Perf. test simply iterates over different Matcher implementations and measures ellapsed time (not including Matcher creation, pure forward scan to the end) for different set bit densities.

          imho, this code is not sufficiantly tested nor commented, needs an hour or two.

          As expected, Yonik made this ButSetIterator really fast. What was surprise for me was OpenBitSet nextSetBit() comparing bad to the BitSet (or I made some dummy mistake somewhere?)

          Show
          Eks Dev added a comment - Here are some Matcher implementations, OpenBitsMatcher- the same as the code Paul wrote for BitsMatcher, with replaced OpenBitSet instead -DenseOpenBitsMatcher - Using solr BitSetIterator (for skipTo() to work, one method in BitSetIterator should become public) Also attached one simple test (just basic fuctionality) that also contains one dummy relative performance test Perf. test simply iterates over different Matcher implementations and measures ellapsed time (not including Matcher creation, pure forward scan to the end) for different set bit densities. imho, this code is not sufficiantly tested nor commented, needs an hour or two. As expected, Yonik made this ButSetIterator really fast. What was surprise for me was OpenBitSet nextSetBit() comparing bad to the BitSet (or I made some dummy mistake somewhere?)
          Hide
          Eks Dev added a comment -

          Paul,
          What is exact semantics of skipTo(int) in Matcher?

          • is it OK to skip back and forth before I reach end?
            e.g.: skipTo(0); skipTo(333); skipTo(0);
          • once I reach end, skipTo(int) does nothing (BitsMatcher, exhausted). It is impossible to reposition Matcher after that

          Is this intended behavior, "skip forward until you reach end, and then, you are at the end " ?

          Show
          Eks Dev added a comment - Paul, What is exact semantics of skipTo(int) in Matcher? is it OK to skip back and forth before I reach end? e.g.: skipTo(0); skipTo(333); skipTo(0); once I reach end, skipTo(int) does nothing (BitsMatcher, exhausted). It is impossible to reposition Matcher after that Is this intended behavior, "skip forward until you reach end, and then, you are at the end " ?
          Hide
          Paul Elschot added a comment -

          > No performance changes as well.

          It's good to hear that. As mentioned earlier, this is groundwork only.
          Once an actual Matcher is used I expect some some performance differences to show up.

          Which comment of Yonik related to HitCollector do you mean?

          > Early this week we will try to implement our first Matchers and see how they behave

          BitsMatcher and SortedVIntList could start that.
          Also I'd like to see one on Solr's OpenBitSet...

          Show
          Paul Elschot added a comment - > No performance changes as well. It's good to hear that. As mentioned earlier, this is groundwork only. Once an actual Matcher is used I expect some some performance differences to show up. Which comment of Yonik related to HitCollector do you mean? > Early this week we will try to implement our first Matchers and see how they behave BitsMatcher and SortedVIntList could start that. Also I'd like to see one on Solr's OpenBitSet...
          Hide
          Eks Dev added a comment -

          Hi Paul,
          for me, this patch did not cause any incompatibility issues. All our tests passed without noticing any difference to the previous trunk version. No performance changes as well ( we use HitCollector only, so Yoniks comment does not apply here).
          Tests are application level, and make index hot (6hrs searches with test batch of requests with known responses), 50Mio not artificial docs, real requests...

          Early this week we will try to implement our first Matchers and see how they behave

          Show
          Eks Dev added a comment - Hi Paul, for me, this patch did not cause any incompatibility issues. All our tests passed without noticing any difference to the previous trunk version. No performance changes as well ( we use HitCollector only, so Yoniks comment does not apply here). Tests are application level, and make index hot (6hrs searches with test batch of requests with known responses), 50Mio not artificial docs, real requests... Early this week we will try to implement our first Matchers and see how they behave
          Hide
          Eks Dev added a comment -

          using the latest Matcher20060830.patch

          ant said "BUILD SUCCESSFUL"

          I will see how it works on some real life cases using our 50Mio collection and report back what our standard app level tests have to say (we have standardized collection /requsts/hits so bad things should pop -up quickly). Need a day or two for this.

          thanks for this work.
          e.

          Show
          Eks Dev added a comment - using the latest Matcher20060830.patch ant said "BUILD SUCCESSFUL" I will see how it works on some real life cases using our 50Mio collection and report back what our standard app level tests have to say (we have standardized collection /requsts/hits so bad things should pop -up quickly). Need a day or two for this. thanks for this work. e.
          Hide
          Paul Elschot added a comment -

          Yonik, as to you questions:

          > It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right?

          Right. Only the previous Filter-20060628.patch contains some commented FIXME code to actually introduce a BitsMatcher in each case where a BitSet is used.

          >When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs
          > to be removed, or a check done in IndexSearcher.search() to disable skipping on the scorer if it's in use.

          Iirc the patch still supports the 1.4 BooleanScorer when a BitSet is returned by Filter. I'd have to have a look at the patched IndexSearcher to be sure though.
          A BitSet is randomly addressable, so it can work to filter the 1.4 BooleanScorer which can score documents out of order. This case can be deprecated completely by also deprecating the possibility to use the 1.4 boolean scorer, but that is not in the patch. The patch only deprecates the Filter.bits() method.

          > I wonder what the performance impact is... for a dense search with a dense bitset
          > filter, it looks like quite a bit of overhead is added (two calls in order to get the next
          > doc, use of nextSetBit() instead of get(), checking "exhausted" each time and
          > checking for -1 to set exhausted). I suppose one can always drop back to using
          > a HitCollector for special cases though.

          BitsMatcher could also work without the "exhausted" flag, but then an infinite loop
          might occur when trying to continue after the first time next() or skipTo() returned false.
          Continuing after false was returned in these cases is a bug, however an infinite loop
          can be difficult to debug. I'd rather be on the safe side of that with the exhausted flag and wait for an actual profile to show the performance problem.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - Yonik, as to you questions: > It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right? Right. Only the previous Filter-20060628.patch contains some commented FIXME code to actually introduce a BitsMatcher in each case where a BitSet is used. >When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs > to be removed, or a check done in IndexSearcher.search() to disable skipping on the scorer if it's in use. Iirc the patch still supports the 1.4 BooleanScorer when a BitSet is returned by Filter. I'd have to have a look at the patched IndexSearcher to be sure though. A BitSet is randomly addressable, so it can work to filter the 1.4 BooleanScorer which can score documents out of order. This case can be deprecated completely by also deprecating the possibility to use the 1.4 boolean scorer, but that is not in the patch. The patch only deprecates the Filter.bits() method. > I wonder what the performance impact is... for a dense search with a dense bitset > filter, it looks like quite a bit of overhead is added (two calls in order to get the next > doc, use of nextSetBit() instead of get(), checking "exhausted" each time and > checking for -1 to set exhausted). I suppose one can always drop back to using > a HitCollector for special cases though. BitsMatcher could also work without the "exhausted" flag, but then an infinite loop might occur when trying to continue after the first time next() or skipTo() returned false. Continuing after false was returned in these cases is a bug, however an infinite loop can be difficult to debug. I'd rather be on the safe side of that with the exhausted flag and wait for an actual profile to show the performance problem. Regards, Paul Elschot
          Hide
          Paul Elschot added a comment -

          Matcher20060830b.patch corrects 2 mistakes in Matcher20060830.patch:
          Searchable.java was present twice, and TestSortedVIntList was not present.
          Thanks eks for pointing out the mistake.

          This patch was generated from the trunk directory.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - Matcher20060830b.patch corrects 2 mistakes in Matcher20060830.patch: Searchable.java was present twice, and TestSortedVIntList was not present. Thanks eks for pointing out the mistake. This patch was generated from the trunk directory. Regards, Paul Elschot
          Hide
          Yonik Seeley added a comment -

          Thanks Paul,
          I like the Matcher/Scorer relation.

          It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right?

          When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs to be removed, or a check done in IndexSearcher.search() to disable skipping on the scorer if it's in use.

          I wonder what the performance impact is... for a dense search with a dense bitset filter, it looks like quite a bit of overhead is added (two calls in order to get the next doc, use of nextSetBit() instead of get(), checking "exhausted" each time and checking for -1 to set exhausted). I suppose one can always drop back to using a HitCollector for special cases though.

          Show
          Yonik Seeley added a comment - Thanks Paul, I like the Matcher/Scorer relation. It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right? When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs to be removed, or a check done in IndexSearcher.search() to disable skipping on the scorer if it's in use. I wonder what the performance impact is... for a dense search with a dense bitset filter, it looks like quite a bit of overhead is added (two calls in order to get the next doc, use of nextSetBit() instead of get(), checking "exhausted" each time and checking for -1 to set exhausted). I suppose one can always drop back to using a HitCollector for special cases though.
          Hide
          Paul Elschot added a comment -

          As requested on java-dev, Matcher20060830.patch is the whole thing as a single patch, relative to srv/java/org/apache/lucene in the trunk, revision 438598 of 30 August 2006.

          This does not contain the FIXME the earlier posted Filter-20060628.patch .
          This FIXME code can be used to test that IndexSearcher works correctly with a BitsMatcher filter instead of with the current BitSet filter.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - As requested on java-dev, Matcher20060830.patch is the whole thing as a single patch, relative to srv/java/org/apache/lucene in the trunk, revision 438598 of 30 August 2006. This does not contain the FIXME the earlier posted Filter-20060628.patch . This FIXME code can be used to test that IndexSearcher works correctly with a BitsMatcher filter instead of with the current BitSet filter. Regards, Paul Elschot
          Hide
          Paul Elschot added a comment -

          Perhaps the BitsMatcher class is better implemented as a static method in a new class MatcherUtils.createMatcher(BitSet).
          Similar methods could be added for OpenBitSet, SortedVIntList, int[] and whatever data structure comes around for implementing Filter.getMatcher(IndexReader).
          When Matcher is a superclass of Scorer, TermScorer already implements a Matcher for TermDocs.

          Show
          Paul Elschot added a comment - Perhaps the BitsMatcher class is better implemented as a static method in a new class MatcherUtils.createMatcher(BitSet). Similar methods could be added for OpenBitSet, SortedVIntList, int[] and whatever data structure comes around for implementing Filter.getMatcher(IndexReader). When Matcher is a superclass of Scorer, TermScorer already implements a Matcher for TermDocs.
          Hide
          Paul Elschot added a comment -

          Added some javadocs to BitsMatcher.
          Added Matcher constructor to SortedVIntList, and extended the test for this.

          Show
          Paul Elschot added a comment - Added some javadocs to BitsMatcher. Added Matcher constructor to SortedVIntList, and extended the test for this.
          Hide
          Paul Elschot added a comment -

          Patches against trunk revision 417683, current.
          Compared to previous patches/files, there are only javadoc updates,
          and the javadocs of Searchable are also patched.

          Show
          Paul Elschot added a comment - Patches against trunk revision 417683, current. Compared to previous patches/files, there are only javadoc updates, and the javadocs of Searchable are also patched.
          Hide
          Eks Dev added a comment -

          Any toughts on adding OpenBitSet from solr here?

          Show
          Eks Dev added a comment - Any toughts on adding OpenBitSet from solr here?
          Hide
          Paul Elschot added a comment -

          I've started to improve the javadocs of almost all code posted here, so it's probably not worthwhile to commit this as it is now.
          I don't expect changes to the java code in the short term.

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - I've started to improve the javadocs of almost all code posted here, so it's probably not worthwhile to commit this as it is now. I don't expect changes to the java code in the short term. Regards, Paul Elschot
          Hide
          Paul Elschot added a comment -

          I hope I got all the attachments right, please holler in case something does not patch or compile cleanly.

          Some questions/remarks:

          • When IndexSearcher gets a BitSet from a Filter, it will not use skipTo() on the Scorer
            of the Query being filtered.
            This still allows to use the 1.4 BooleanScorer until Filter.getBits() is removed.
          • Ok. not to add match() method(s) to Searcher/Searchable ?
          • BitSetIterator of SOLR-15 could implement a Matcher, and perhaps to be added to org.apache.lucene.util ?
          • Matcher as superclass of Scorer opens possibility to add BooleanQuery.add(Filter) method.
            This also needs the addition of required Matchers to ConjunctionScorer and the addition of prohibited Matchers at ReqExclScorer/DisjunctionScorer.
            Doing this filtering in ConjunctionScorer/ReqExclScorer will probably reduce the number of method calls for filtering.
            Once such an addition is done to BooleanQuery, the filtering methods in IndexSearcher could be deprecated in favour of BooleanQuery.add(Filter).

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - I hope I got all the attachments right, please holler in case something does not patch or compile cleanly. Some questions/remarks: When IndexSearcher gets a BitSet from a Filter, it will not use skipTo() on the Scorer of the Query being filtered. This still allows to use the 1.4 BooleanScorer until Filter.getBits() is removed. Ok. not to add match() method(s) to Searcher/Searchable ? BitSetIterator of SOLR-15 could implement a Matcher, and perhaps to be added to org.apache.lucene.util ? Matcher as superclass of Scorer opens possibility to add BooleanQuery.add(Filter) method. This also needs the addition of required Matchers to ConjunctionScorer and the addition of prohibited Matchers at ReqExclScorer/DisjunctionScorer. Doing this filtering in ConjunctionScorer/ReqExclScorer will probably reduce the number of method calls for filtering. Once such an addition is done to BooleanQuery, the filtering methods in IndexSearcher could be deprecated in favour of BooleanQuery.add(Filter). Regards, Paul Elschot
          Hide
          Paul Elschot added a comment -

          LUCENE-328 is superseded by this issue.

          Show
          Paul Elschot added a comment - LUCENE-328 is superseded by this issue.
          Hide
          Paul Elschot added a comment -

          TestSortedVIntList.java, superseding the one in LUCENE-328 testing the Matcher provided by a SortedVIntList.

          Show
          Paul Elschot added a comment - TestSortedVIntList.java, superseding the one in LUCENE-328 testing the Matcher provided by a SortedVIntList.
          Hide
          Paul Elschot added a comment -

          SortedVIntList.java for org.apache.lucene.util superseding the one in LUCENE-328. Has a getMatcher() method.

          Show
          Paul Elschot added a comment - SortedVIntList.java for org.apache.lucene.util superseding the one in LUCENE-328 . Has a getMatcher() method.
          Hide
          Paul Elschot added a comment -

          A Matcher constructed from a BitSet for org.apache.lucene.util.

          Show
          Paul Elschot added a comment - A Matcher constructed from a BitSet for org.apache.lucene.util.
          Hide
          Paul Elschot added a comment -

          Patch to IndexSearcher.java to prefer getMatcher() over getBits() on Filter.
          Also add method IndexSearcher.match(Query, MatchCollector).

          Show
          Paul Elschot added a comment - Patch to IndexSearcher.java to prefer getMatcher() over getBits() on Filter. Also add method IndexSearcher.match(Query, MatchCollector).
          Hide
          Paul Elschot added a comment -

          patch to Filter to add getMatcher() and to deprecate getBits() in favour of getMatcher().
          Includes commented test code to test IndexSearcher using BitsMatcher.

          Show
          Paul Elschot added a comment - patch to Filter to add getMatcher() and to deprecate getBits() in favour of getMatcher(). Includes commented test code to test IndexSearcher using BitsMatcher.
          Hide
          Paul Elschot added a comment -

          patch to Scorer.java to subclass Matcher.

          Show
          Paul Elschot added a comment - patch to Scorer.java to subclass Matcher.
          Hide
          Paul Elschot added a comment -

          Matcher.java, including a match(MatchCollector) method, for org.apache.lucene.search.

          Show
          Paul Elschot added a comment - Matcher.java, including a match(MatchCollector) method, for org.apache.lucene.search.
          Hide
          Paul Elschot added a comment -

          MatchCollector.java with collect(int) method for org.apache.lucene.search.

          Show
          Paul Elschot added a comment - MatchCollector.java with collect(int) method for org.apache.lucene.search.
          Hide
          Paul Elschot added a comment -

          javadocs of Searcher.java to use 'matching' instead of 'non-zero score',
          and to describe the Filter effect more accurately.
          This is independent of the Matcher/Scorer change.

          Show
          Paul Elschot added a comment - javadocs of Searcher.java to use 'matching' instead of 'non-zero score', and to describe the Filter effect more accurately. This is independent of the Matcher/Scorer change.
          Hide
          Paul Elschot added a comment -

          javadocs of HitCollector.java to use 'matching' instead of 'non-zero score'.
          This is actually independent of the Matcher/Scorer change.

          Show
          Paul Elschot added a comment - javadocs of HitCollector.java to use 'matching' instead of 'non-zero score'. This is actually independent of the Matcher/Scorer change.
          Hide
          Paul Elschot added a comment -

          As the title of this issue is as accurate as it gets, I'm attaching a series of patches and additions here that make Scorer a subclass of Matcher, while Matcher takes the current role of the BitSet in Filter.
          All patches against trunk revision 417299.

          Show
          Paul Elschot added a comment - As the title of this issue is as accurate as it gets, I'm attaching a series of patches and additions here that make Scorer a subclass of Matcher, while Matcher takes the current role of the BitSet in Filter. All patches against trunk revision 417299.
          Hide
          Peter Schäfer added a comment -

          thanks, this looks interesting.

          Regards,
          Peter

          Show
          Peter Schäfer added a comment - thanks, this looks interesting. Regards, Peter
          Hide
          Eks Dev added a comment -

          Peter,

          there is some advanced things you are probably interested in.

          see:
          "some utilities for a compact sparse filter" LUCENE-328

          Also interesting:
          SOLR-15 OpenBitSet - ASF JIRA

          complete solr solution for Filters is one cool thing! a bit awkward bridge to lucene due to BitSet in Filter, but this is due to be resolved...

          Show
          Eks Dev added a comment - Peter, there is some advanced things you are probably interested in. see: "some utilities for a compact sparse filter" LUCENE-328 Also interesting: SOLR-15 OpenBitSet - ASF JIRA complete solr solution for Filters is one cool thing! a bit awkward bridge to lucene due to BitSet in Filter, but this is due to be resolved...

            People

            • Assignee:
              Mark Miller
              Reporter:
              Peter Schäfer
            • Votes:
              11 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development