[LUCENE-9507] Custom order for leaves in DirectoryReader, IndexWriter and searcher - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.0, 8.9
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Now that we're able to skip documents efficiently when sorting by a numeric field, I was wondering if we could optimize sorted queries further by also sorting the leaf readers based on the primary sort.

For time-based indices in Elasticsearch, we've implemented an optimization that does that at query time. If the query is sorted by a numeric docvalue field, prior to search, we sort the leaves according to the query sort. When sorting by timestamp this small optimization can have a big impact since early termination can be reached much faster if the sort values in the segments don't overlap too much. Applying this optimization at query time is challenging , it has the benefit to work on any numeric field sort and order but it requires to use a multi-reader that will reorganize the segments. It can also be deceptive that after a force merge to 1 segment sorted queries may be slower since there is nothing to sort anymore.

So, another option that I look at is to add the ability to provide a leaf order directly in the IndexWriter and DirectoryReader. That could be similar to an index sort or even complementary to it since sorting segments based on the index sort could also help at query time. For time-based indices that cannot afford index sorting but have lots of sorted queries on timestamp, forcing the order of segments could speed up sorted queries significantly.

The advantage of forcing a single leaf sort in the writer/reader is that we can also use it to influence the merges by putting the segments with the highest value first. That would help with the case of indices that are merged to a single segment but would like to keep the sorted queries fast but also for the multi-segments case since big segments would have more chance to have highest values first too.

Attachments

Issue Links

links to

GitHub Pull Request #32

GitHub Pull Request #2256

GitHub Pull Request #2473

Activity

People

Assignee:: Unassigned

Reporter:: Jim Ferenczi

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/Sep/20 13:07

Updated:: 28/Aug/22 16:07

Resolved:: 27/May/21 21:04

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

5h 50m