When query server needs to handle millions of records, CubeTupleConverter could become performance bottleneck.
An experiment shows that converting 5 millions records takes ~11s, which accounts for 50% of the total query time.
Records returned from each storage partition is guaranteed to be ordered. Therefore we could reduce the number of records passed to CubeTupleConverter by
- merge sorted records from all partitions, similar to what we have done in
- use a stream aggregate algorithm on merged stream to aggregate those records with the same key
- Add a new physical operator GTStreamAggregateScanner which implements the stream aggregate algorithm
- Refine SortedIteratorMergerWithLimit that was used to merge sort records from different partitions. The previous implementation has performance issues (
KYLIN-2483) due to expensive record clone
- Leverage GTStreamAggregateScanner to aggregate records on merged stream
Stream aggregate has some good properties such as low memory usage and streamable ordered outputs, making it better than hash/sort based alternatives when input is already sorted. So I bet the new GTStreamAggregateScanner operator can also be used to accelerate cubing and coprocessor aggregation in certain cases. I'll focus on query server in this jira and leave those optimizations as future works.