Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10311

Should DocIdSetBuilder have different implementations for point and terms?

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      DocIdSetBuilder has two API implementations, one for terms queries and one for point values queries. In each cases they are used in totally different way.

      For terms the API looks like:

       

      /**
       * Add the content of the provided {@link DocIdSetIterator} to this builder. NOTE: if you need to
       * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you should rather use {@link
       * RoaringDocIdSet.Builder}.
       */
      void add(DocIdSetIterator iter) throws IOException;
      
      /** Build a {@link DocIdSet} from the accumulated doc IDs. */
      DocIdSet build() 
      
      

       

      For Point Values it looks like:

       

      /**
       * Utility class to efficiently add many docs in one go.
       *
       * @see DocIdSetBuilder#grow
       */
      public abstract static class BulkAdder {
        public abstract void add(int doc);
      
        public void add(DocIdSetIterator iterator) throws IOException {
          int docID;
          while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
            add(docID);
          }
        }
      }
      
      
      /**
       * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code
       * numDocs} documents.
       */
      
      
      /** Build a {@link DocIdSet} from the accumulated doc IDs. */
      DocIdSet build()  public BulkAdder grow(int numDocs) 
      
      

       

       

      This is becoming trappy for new developments in the PointValue API.

      1) When we call #grow() from the PointValues API, we are not telling the builder how many docs we are going to add (as we don't really know it) but the number of points we are about to visit. This number can be bigger than Integer.MAX_VALUE. Until now, we get around this issue by making sure we don't call this API when we need to add more than Integer.MAX_VALUE points. In that case we will navigate the tree down until the number of points is reduced and they can fit in an int.

      This has work well until now because we are calling grow from inside the BKD reader, and the BKD writer/reader makes sure than the number of points in a leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which does not enforce that the number of points on a leaf needs to fit in an int.  This causes friction and inconsistency in the API.

       

      2) This a secondary issue that I found when thinking in this issue. In Lucene- we added the possibility to add a `DocIdSetIterator` from the PointValues API.  Therefore there are two ways to add those kind of objects to a DocIdSetBuilder which can end up in different results:

       

      {
        // Terms API
        docIdSetBuilder.add(docIdSetIterator); 
      }
      {
        // Point values API
        docIdSetBuilder.grow(doc).add(docIdSetIterator)
      }

       

      I wonder if we need to rethink this API, should we have different implementation for Terms and Point values?

       

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            ivera Ignacio Vera
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 9h 50m
                9h 50m