[IMPALA-2791] Avoid unnecessary two-phased aggregation. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: Impala 2.2, Impala 2.3.0
Fix Version/s: None
Component/s: Frontend
Labels:
- performance
- planner

Target Version:

Product Backlog

Description

We perform a two-phased aggregation for evaluating distinct aggregate expressions like count(distinct). However, if the distinct aggregate expression is not referenced in enclosing query blocks, then the two-phased aggregation is unnecessary and should be skipped.

Example:

 explain select x from
  (select count(int_col) x, count(distinct bigint_col) y from functional.alltypes) v;
+-----------------------------------------------------------+
| Explain String                                            |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=170.00MB VCores=2 |
|                                                           |
| 06:AGGREGATE [FINALIZE]                                   |
| |  output: count:merge(bigint_col), count:merge(int_col)  |
| |                                                         |
| 05:EXCHANGE [UNPARTITIONED]                               |
| |                                                         |
| 02:AGGREGATE                                              |
| |  output: count(bigint_col), count:merge(int_col)        |
| |                                                         |
| 04:AGGREGATE                                              |
| |  output: count:merge(int_col)                           |
| |  group by: bigint_col                                   |
| |                                                         |
| 03:EXCHANGE [HASH(bigint_col)]                            |
| |                                                         |
| 01:AGGREGATE                                              |
| |  output: count(int_col)                                 |
| |  group by: bigint_col                                   |
| |                                                         |
| 00:SCAN HDFS [functional.alltypes]                        |
|    partitions=24/24 files=24 size=478.45KB                |
+-----------------------------------------------------------+

In the query above, it is unnecessary to compute the "count(distinct bigint_col)" aggregate expression, so a single-phased aggregation would be sufficient.

One way to fix this issue would be to defer creation of the AggregateInfo to the planning phase where the materialization of aggregate expressions is known. Currently, we create the AggregateInfo during analysis. Retroactively "fixing" an AggregateInfo during planning to remover the two phases seems complicated.

This limitation inhibits other optimizations such as ~~IMPALA-2499~~.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Alexander Behm

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Dec/15 22:37

Updated:: 11/Jul/18 17:28