Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.10.0
-
ghx-label-5
Description
IMPALA-3200 offers an opportunity to improve the spilling sort algorithm:
- Use the reliability of reservations to select the most efficient order to conduct merges in (rather than greedily trying to maximise the fan-in of the current merge). We want to minimise the depth of the merge tree, then structure the tree based on the preferred fan-in.
- Do multiple-buffering of the stream being written (this happens automatically if there are free buffers in the BufferPool client).
- Do multiple-buffering of the streams being read, instead of blocking on read I/O frequently.
More concretely, the idea is to implement double-buffering of spilled input runs by calling BufferPool::Pin() early to prefetch the second page in each input Run. Currently only one page per input run is pinned, which means that the sorter frequently blocks on I/O.
I'd suggest doing this in two steps.
The first step is to change how the fan-in of each merge run is selected. We know the number of runs to be merged and the buffer reservation that is available, so we can compute the maximum possible fan-in of each merge step (assuming 1 buffer for the output run and 1 buffer for each input run to the merge). We can then calculate the minimum number of rounds of merging required and, based on that, decide how the runs should be merged (you could think about it as a tree of merge operations). I think we want to reduce the number of bytes written to disk. E.g. if we have 5 buffers and 8 input runs, we should merge input runs (1,2,3,4) then merge that intermediate runs with runs (5,6,7). It's reasonable to assume that the input runs are all approximate the same size.
ee53ddb389549247f5bfe760d446dc7b3b963a29 actually removed some logic along those lines because it didn't work with the old buffer management scheme. The logic before that commit might provide some ideas. There are also some related TODOs in Sorter::MergeIntermediateRuns() and Sorter::CreateMerger() to simplify how the number of input runs is decided and how the merger is set up:
// TODO: once we have reliable reservations (IMPALA-3200), we should calculate this // based on the available reservations. .... // TODO: this isn't optimal: we could defer creating the merged run if we have // reliable reservations (IMPALA-3200). ... // TODO: IMPALA-3200: we should not need this logic once we have reliable // reservations (IMPALA-3200).
The second step would be to adjust the logic from the first step to reserve 2 buffers per input and output run and then implement the logic to call Pin() earlier to prefetch the page after the current page.