Description
Now that we're able to skip documents efficiently when sorting by a numeric field, I was wondering if we could optimize sorted queries further by also sorting the leaf readers based on the primary sort.
For time-based indices in Elasticsearch, we've implemented an optimization that does that at query time. If the query is sorted by a numeric docvalue field, prior to search, we sort the leaves according to the query sort. When sorting by timestamp this small optimization can have a big impact since early termination can be reached much faster if the sort values in the segments don't overlap too much. Applying this optimization at query time is challenging , it has the benefit to work on any numeric field sort and order but it requires to use a multi-reader that will reorganize the segments. It can also be deceptive that after a force merge to 1 segment sorted queries may be slower since there is nothing to sort anymore.
So, another option that I look at is to add the ability to provide a leaf order directly in the IndexWriter and DirectoryReader. That could be similar to an index sort or even complementary to it since sorting segments based on the index sort could also help at query time. For time-based indices that cannot afford index sorting but have lots of sorted queries on timestamp, forcing the order of segments could speed up sorted queries significantly.
The advantage of forcing a single leaf sort in the writer/reader is that we can also use it to influence the merges by putting the segments with the highest value first. That would help with the case of indices that are merged to a single segment but would like to keep the sorted queries fast but also for the multi-segments case since big segments would have more chance to have highest values first too.