[LUCENE-10311] Should DocIdSetBuilder have different implementations for point and terms? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

DocIdSetBuilder has two API implementations, one for terms queries and one for point values queries. In each cases they are used in totally different way.

For terms the API looks like:

/**
 * Add the content of the provided {@link DocIdSetIterator} to this builder. NOTE: if you need to
 * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you should rather use {@link
 * RoaringDocIdSet.Builder}.
 */
void add(DocIdSetIterator iter) throws IOException;

/** Build a {@link DocIdSet} from the accumulated doc IDs. */
DocIdSet build()

For Point Values it looks like:

/**
 * Utility class to efficiently add many docs in one go.
 *
 * @see DocIdSetBuilder#grow
 */
public abstract static class BulkAdder {
  public abstract void add(int doc);

  public void add(DocIdSetIterator iterator) throws IOException {
    int docID;
    while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
      add(docID);
    }
  }
}


/**
 * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code
 * numDocs} documents.
 */


/** Build a {@link DocIdSet} from the accumulated doc IDs. */
DocIdSet build()  public BulkAdder grow(int numDocs)

This is becoming trappy for new developments in the PointValue API.

1) When we call #grow() from the PointValues API, we are not telling the builder how many docs we are going to add (as we don't really know it) but the number of points we are about to visit. This number can be bigger than Integer.MAX_VALUE. Until now, we get around this issue by making sure we don't call this API when we need to add more than Integer.MAX_VALUE points. In that case we will navigate the tree down until the number of points is reduced and they can fit in an int.

This has work well until now because we are calling grow from inside the BKD reader, and the BKD writer/reader makes sure than the number of points in a leaf can fit in an int. In LUCENE-, we re moving into a cursor-like API which does not enforce that the number of points on a leaf needs to fit in an int. This causes friction and inconsistency in the API.

2) This a secondary issue that I found when thinking in this issue. In Lucene- we added the possibility to add a `DocIdSetIterator` from the PointValues API. Therefore there are two ways to add those kind of objects to a DocIdSetBuilder which can end up in different results:

{
  // Terms API
  docIdSetBuilder.add(docIdSetIterator); 
}
{
  // Point values API
  docIdSetBuilder.grow(doc).add(docIdSetIterator)
}

I wonder if we need to rethink this API, should we have different implementation for Terms and Point values?

Attachments

Issue Links

links to

GitHub Pull Request #692

GitHub Pull Request #709

GitHub Pull Request #710

GitHub Pull Request #724

Activity

People

Assignee:: Unassigned

Reporter:: Ignacio Vera

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Dec/21 12:33

Updated:: 28/Aug/22 16:32

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

9h 50m