Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8888

Improve distribution of points with data dimension in BKD tree leaves

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.0, 8.2
    • None
    • None
    • New

    Description

      In LUCENE-8688 it was introduce a new storing strategy for leaves contains duplicated points. This works well with indexed dimension as the process of partition the space and the final sorting of leaves groups points with equal indexed dimensions.

      This is not the case all the time if the point contain data dimensions. It might happen that if two points have the same indexed dimensions but different data dimensions, the distribution on the leaves is not the most optimal.

      A good example is if a user tries to index a bounding box using LatLonShape. The resulting tessellation of a bounding box is two triangles with the same indexed dimensions but different data dimensions. If there are two documents indexing the same bounding box, the result in the leaf is the triangles from one document followed by the triangles of the second document. This is because the current sorting/selection algorithms use one indexed dimension and tie-break on the
      docID.

      The most optimal distribution in the case above is two group together the equal triangles. Therefore what it is propose here is to update the selection/ sorting algorithms to use the data dimensions when they exist as tie-breakers before using the docID.

      Attachments

        Issue Links

          Activity

            People

              ivera Ignacio Vera
              ivera Ignacio Vera
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m