When an aggregation requires a shuffle, Spark SQL performs separate partial and final aggregations:
However, consider what happens if the dataset being aggregated is already pre-partitioned by the aggregate's grouping columns:
Here, we end up with back-to-back HashAggregate operators which are performed as part of the same stage.
For certain aggregates (e.g. sum, count), this duplication is unnecessary: we could have just performed a total aggregation instead (since we already have all of the data co-located)!
The duplicate aggregate is problematic in cases where the aggregate inputs and outputs are the same order of magnitude (e.g.counting the number of duplicate records in a dataset where duplicates are extremely rare).
My motivation for this optimization is similar to
SPARK-1412: I know that partial aggregation doesn't help for my workload, so I wanted to somehow coerce Spark into skipping the ineffective partial aggregation and jumping directly to total aggregation. I thought that pre-partitioning would accomplish this, but doing so didn't achieve my goal due to the missing aggregation-collapsing (or partial-aggregate skipping) optimization.