Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.8.0
Description
When working on IMPALA-4174 realized that the planner doesn't handle cardinality estimation for multi column joins.
Another variation of the same problem is when the primary key is consistent of multiple columns, for TPC-DS store_sales the primary key is (ss_ticket_number, ss_item_sk) and for store_returns the primary key is (sr_ticket_number, sr_item_sk), so the join condition below is a PK/PK join and the current algorithm doesn't handle that case and the estimate for the join uses the many to many calculation which is incorrect
SELECT count(*) FROM store_sales, store_returns WHERE ss_customer_sk = sr_customer_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number
+-------------------------------------------------------------------------------------------------------------------+
| Explain String |
+-------------------------------------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=1.15GB VCores=2 |
| |
| PLAN-ROOT SINK |
| | |
| 07:AGGREGATE [FINALIZE] |
| | output: count:merge(*) |
| | hosts=20 per-host-mem=unavailable |
| | tuple-ids=2 row-size=8B cardinality=1 |
| | |
| 06:EXCHANGE [UNPARTITIONED] |
| | hosts=20 per-host-mem=unavailable |
| | tuple-ids=2 row-size=8B cardinality=1 |
| | |
| 03:AGGREGATE |
| | output: count(*) |
| | hosts=20 per-host-mem=10.00MB |
| | tuple-ids=2 row-size=8B cardinality=1 |
| | |
| 02:HASH JOIN [INNER JOIN, PARTITIONED] |
| | hash predicates: ss_customer_sk = sr_customer_sk, ss_item_sk = sr_item_sk, ss_ticket_number = sr_ticket_number |
| | runtime filters: RF000 <- sr_customer_sk, RF001 <- sr_item_sk, RF002 <- sr_ticket_number |
| | hosts=20 per-host-mem=906.36MB |
| | tuple-ids=0,1 row-size=40B cardinality=15477599674736 |
| | |
| |--05:EXCHANGE [HASH(sr_customer_sk,sr_item_sk,sr_ticket_number)] |
| | | hosts=20 per-host-mem=0B |
| | | tuple-ids=1 row-size=20B cardinality=863989652 |
| | | |
| | 01:SCAN HDFS [tpcds_3000_parquet.store_returns, RANDOM] |
| | partitions=2004/2004 files=2008 size=48.92GB |
| | table stats: 863989652 rows total |
| | column stats: all |
| | hosts=20 per-host-mem=120.00MB |
| | tuple-ids=1 row-size=20B cardinality=863989652 |
| | |
| 04:EXCHANGE [HASH(ss_customer_sk,ss_item_sk,ss_ticket_number)] |
| | hosts=20 per-host-mem=0B |
| | tuple-ids=0 row-size=20B cardinality=8639936081 |
| | |
| 00:SCAN HDFS [tpcds_3000_parquet.store_sales, RANDOM] |
| partitions=1824/1824 files=2649 size=376.48GB |
| runtime filters: RF000 -> ss_customer_sk, RF001 -> ss_item_sk, RF002 -> ss_ticket_number |
| table stats: 8639936081 rows total |
| column stats: all |
| hosts=20 per-host-mem=264.00MB |
| tuple-ids=0 row-size=20B cardinality=8639936081 |
+-------------------------------------------------------------------------------------------------------------------+
Attachments
Issue Links
- blocks
-
IMPALA-4174 Planner incorrectly estimates cardinality for many to many joins
- Resolved