Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
In most cases, partial distinct aggregation is not needed in map-side groupby. The exception is that with sorted bucketized tables partial distinct aggregation can be done by the mappers in some scenarios, as what is done by GroupByOptimzer.
Currently, partial distinct aggregation is done in the map-side GroupBy and then shuffle of the partial result is done in the following ReduceSink operator, in cases where they are not needed. This wastes CPU cycles, memory and network bandwidth.
This optimization eliminates un-needed partial distinct aggregations, which improves performance and reduces memory usage.
For example,
EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key;
Before optimization:
Group By Operator aggregations: expr: count(DISTINCT value) bucketGroup: false keys: expr: key type: int expr: value type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: int expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: int tag: -1 value expressions: expr: _col2 type: bigint
After optimization:
Group By Operator bucketGroup: false keys: expr: key type: int expr: value type: string mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: int expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: int tag: -1