Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
Impala 2.2, Impala 2.3.0
-
None
Description
We perform a two-phased aggregation for evaluating distinct aggregate expressions like count(distinct). However, if the distinct aggregate expression is not referenced in enclosing query blocks, then the two-phased aggregation is unnecessary and should be skipped.
Example:
explain select x from
(select count(int_col) x, count(distinct bigint_col) y from functional.alltypes) v;
+-----------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=170.00MB VCores=2 |
| |
| 06:AGGREGATE [FINALIZE] |
| | output: count:merge(bigint_col), count:merge(int_col) |
| | |
| 05:EXCHANGE [UNPARTITIONED] |
| | |
| 02:AGGREGATE |
| | output: count(bigint_col), count:merge(int_col) |
| | |
| 04:AGGREGATE |
| | output: count:merge(int_col) |
| | group by: bigint_col |
| | |
| 03:EXCHANGE [HASH(bigint_col)] |
| | |
| 01:AGGREGATE |
| | output: count(int_col) |
| | group by: bigint_col |
| | |
| 00:SCAN HDFS [functional.alltypes] |
| partitions=24/24 files=24 size=478.45KB |
+-----------------------------------------------------------+
In the query above, it is unnecessary to compute the "count(distinct bigint_col)" aggregate expression, so a single-phased aggregation would be sufficient.
One way to fix this issue would be to defer creation of the AggregateInfo to the planning phase where the materialization of aggregate expressions is known. Currently, we create the AggregateInfo during analysis. Retroactively "fixing" an AggregateInfo during planning to remover the two phases seems complicated.
This limitation inhibits other optimizations such as IMPALA-2499.