Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9507

Custom order for leaves in DirectoryReader, IndexWriter and searcher

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 9.0, 8.9
    • None
    • None
    • New

    Description

      Now that we're able to skip documents efficiently when sorting by a numeric field, I was wondering if we could optimize sorted queries further by also sorting the leaf readers based on the primary sort.

      For time-based indices in Elasticsearch, we've implemented an optimization that does that at query time. If the query is sorted by a numeric docvalue field, prior to search, we sort the leaves according to the query sort. When sorting by timestamp this small optimization can have a big impact since early termination can be reached much faster if the sort values in the segments don't overlap too much. Applying this optimization at query time is challenging , it has the benefit to work on any numeric field sort and order but it requires to use a multi-reader that will reorganize the segments. It can also be deceptive that after a force merge to 1 segment sorted queries may be slower since there is nothing to sort anymore.

      So, another option that I look at is to add the ability to provide a leaf order directly in the IndexWriter and DirectoryReader. That could be similar to an index sort or even complementary to it since sorting segments based on the index sort could also help at query time. For time-based indices that cannot afford index sorting but have lots of sorted queries on timestamp, forcing the order of segments could speed up sorted queries significantly. 

      The advantage of forcing a single leaf sort in the writer/reader is that we can also use it to influence the merges by putting the segments with the highest value first. That would help with the case of indices that are merged to a single segment but would like to keep the sorted queries fast but also for the multi-segments case since big segments would have more chance to have highest values first too.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jim.ferenczi Jim Ferenczi
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 5h 50m
                5h 50m