Currently, distinct aggregation queries are executed as follows:
- the first stage: it just shuffles tuples by hashing grouping keys.
- the second stage: it sorts them and executes sort aggregation.
This way executes queries including distinct aggregation functions with only two stages. But, it leads to large intermediate data during shuffle phase.
This kind of query can be rewritten as two queries:
I'm expecting that this rewrite will significantly reduce the intermediate data volume and query response time in most cases.