Description
In LUCENE-8688 it was introduce a new storing strategy for leaves contains duplicated points. This works well with indexed dimension as the process of partition the space and the final sorting of leaves groups points with equal indexed dimensions.
This is not the case all the time if the point contain data dimensions. It might happen that if two points have the same indexed dimensions but different data dimensions, the distribution on the leaves is not the most optimal.
A good example is if a user tries to index a bounding box using LatLonShape. The resulting tessellation of a bounding box is two triangles with the same indexed dimensions but different data dimensions. If there are two documents indexing the same bounding box, the result in the leaf is the triangles from one document followed by the triangles of the second document. This is because the current sorting/selection algorithms use one indexed dimension and tie-break on the
docID.
The most optimal distribution in the case above is two group together the equal triangles. Therefore what it is propose here is to update the selection/ sorting algorithms to use the data dimensions when they exist as tie-breakers before using the docID.
Attachments
Issue Links
- links to