Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 4.4.0
-
None
-
ghx-label-1
Description
If an aggregation node has multiple grouping expressions that originate from the same tuple, then their combined NDV must not exceed output cardinality of PlanNode producing that tuple. Take example of this PARALLELPLANS from Q31.
| 11:AGGREGATE [STREAMING] | | output: sum(ss_ext_sales_price) | | group by: ca_county, d_qoy, d_year | | mem-estimate=84.55MB mem-reservation=34.00MB spill-buffer=2.00MB thread-reservation=0 | | tuple-ids=8 row-size=50B cardinality=1.43M cost=1948896250 | | in pipelines: 06(GETNEXT) .... | | 07:SCAN HDFS [tpcds_partitioned_parquet_snap.date_dim, RANDOM] | | HDFS partitions=1/1 files=1 size=2.17MB | | predicates: tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT) | | stored statistics: | | table: rows=73.05K size=2.17MB | | columns: all | | extrapolated-rows=disabled max-scan-range-rows=73.05K | | parquet statistics predicates: tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT) | | parquet dictionary predicates: tpcds_partitioned_parquet_snap.date_dim.d_year = CAST(1998 AS INT), tpcds_partitioned_parquet_snap.date_dim.d_qoy = CAST(2 AS INT) | | mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=0 | | tuple-ids=6 row-size=12B cardinality=186 cost=16728 | | in pipelines: 07(GETNEXT)
Cardinality estimate of 11:AGGREGATE comes from this calculation:
est_cardinality(11:AGG) = NDV(ca_county) * NDV(d_qoy) * NDV (d_year) = 1825 * 4 * 196 = 1430800
However, d_qoy and d_year belong to the same TupleId 6 coming out from 07:SCAN, so its cardinality can be estimated lower to this:
est_cardinality(11:AGG) = NDV(ca_county) * est_cardinality(07:SCAN) = 1825 * 186 = 339450