Description
With query below the planner checks the NDV and row count for catalog_returns and decides that the join between store_sales and catalog_returns is a many to many join and estimates that the join will produce 17 Billion rows.
The join column cr_item_sk is actually unique from the subquery since there is a group by on cr_item_sk and the cardinality estimation algorithm used for the join should be a one to many.
WITH cs_ui AS (SELECT cr_item_sk FROM catalog_returns GROUP BY cr_item_sk) SELECT Count(*) cnt FROM store_sales, cs_ui WHERE ss_item_sk = cs_ui.cr_item_sk
+--------------------------------------------------------------+
| Explain String |
+--------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=101.07MB VCores=2 |
| |
| PLAN-ROOT SINK |
| | |
| 09:AGGREGATE [FINALIZE] |
| | output: count:merge(*) |
| | hosts=20 per-host-mem=unavailable |
| | tuple-ids=4 row-size=8B cardinality=1 |
| | |
| 08:EXCHANGE [UNPARTITIONED] |
| | hosts=20 per-host-mem=unavailable |
| | tuple-ids=4 row-size=8B cardinality=1 |
| | |
| 04:AGGREGATE |
| | output: count(*) |
| | hosts=20 per-host-mem=10.00MB |
| | tuple-ids=4 row-size=8B cardinality=1 |
| | |
| 03:HASH JOIN [INNER JOIN, BROADCAST] |
| | hash predicates: ss_item_sk = cr_item_sk |
| | runtime filters: RF000 <- cr_item_sk |
| | hosts=20 per-host-mem=3.07MB |
| | tuple-ids=0,2 row-size=16B cardinality=17,966,906,063 |
| | |
| |--07:EXCHANGE [BROADCAST] |
| | | hosts=20 per-host-mem=0B |
| | | tuple-ids=2 row-size=8B cardinality=365989 |
| | | |
| | 06:AGGREGATE [FINALIZE] |
| | | group by: cr_item_sk |
| | | hosts=20 per-host-mem=10.00MB |
| | | tuple-ids=2 row-size=8B cardinality=365989 |
| | | |
| | 05:EXCHANGE [HASH(cr_item_sk)] |
| | | hosts=20 per-host-mem=0B |
| | | tuple-ids=2 row-size=8B cardinality=365989 |
| | | |
| | 02:AGGREGATE [STREAMING] |
| | | group by: cr_item_sk |
| | | hosts=20 per-host-mem=10.00MB |
| | | tuple-ids=2 row-size=8B cardinality=365989 |
| | | |
| | 01:SCAN HDFS [tpcds_3000_parquet.catalog_returns, RANDOM] |
| | partitions=2104/2104 files=2104 size=33.03GB |
| | table stats: 432018033 rows total |
| | column stats: all |
| | hosts=20 per-host-mem=32.00MB |
| | tuple-ids=1 row-size=8B cardinality=432,018,033 |
| | |
| 00:SCAN HDFS [tpcds_3000_parquet.store_sales, RANDOM] |
| partitions=1824/1824 files=2649 size=376.48GB |
| runtime filters: RF000 -> ss_item_sk |
| table stats: 8639936081 rows total |
| column stats: all |
| hosts=20 per-host-mem=88.00MB |
| tuple-ids=0 row-size=8B cardinality=8,639,936,081 |
+--------------------------------------------------------------+
Attachments
Issue Links
- blocks
-
IMPALA-4174 Planner incorrectly estimates cardinality for many to many joins
- Resolved