Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-14608

Faster sorting for the /export handler

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Resolved
    • 9.0
    • 9.0
    • None
    • None

    Description

      The largest cost of the export handler is the sorting. This ticket will implement an improved algorithm for sorting that should greatly increase overall throughput for the export handler.

      The current algorithm is as follows:

      Collect a bitset of matching docs. Iterate over that bitset and materialize the top level oridinals for the sort fields in the document and add them to priority queue of size 30000. Then export the top 30000 docs, turn off the bits in the bit set and iterate again until all docs are sorted and sent. 

      There are two performance bottlenecks with this approach:

      1) Materializing the top level ordinals adds a huge amount of overhead to the sorting process.

      2) The size of priority queue, 30,000, adds significant overhead to sorting operations.

      The new algorithm:

      Has a top level merge sort iterator that wraps segment level iterators that perform segment level priority queue sorts.

      Segment level:

      The segment level docset will be iterated and the segment level ordinals for the sort fields will be materialized and added to a segment level priority queue. As the segment level iterator pops docs from the priority queue the top level ordinals for the sort fields are materialized. Because the top level ordinals are materialized AFTER the sort, they only need to be looked up when the segment level ordinal changes. This takes advantage of the sort to limit the lookups into the top level ordinal structures. This also eliminates redundant lookups of top level ordinals that occur during the multiple passes over the matching docset.

      The segment level priority queues can be kept smaller than 30,000 to improve performance of the sorting operations because the overall batch size will still be 30,000 or greater when all the segment priority queue sizes are added up. This allows for batch sizes much larger then 30,000 without using a single large priority queue. The increased batch size means fewer iterations over the matching docset and the decreased priority queue size means faster sorting operations.

      Top level:

      A top level iterator does a merge sort over the segment level iterators by comparing the top level ordinals materialized when the segment level docs are popped from the segment level priority queues. This requires no extra memory and will be very performant.

       

      Attachments

        Activity

          People

            jbernste Joel Bernstein
            jbernste Joel Bernstein
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: