Details
Description
Selectivity of join predicates are not taken into account when estimating the number of out rows for a join only predicates on the RHS are taken into account.
This can result in in-efficient join order.
Current implementation has a limitation that cardinality estimation for joins is based on base tables rather than expression trees or subqueries.
if (lhsStats.ndv <= rhsStats.ndv * (1.0 + FK_PK_MAX_STATS_DELTA_PERC) || Math.abs(rhsStats.numRows - rhsStats.ndv) / (double) rhsStats.numRows <= FK_PK_MAX_STATS_DELTA_PERC) { // Adjust the join selectivity based on the NDV ratio to avoid underestimating // the cardinality if the PK side has a higher NDV than the FK side. double ndvRatio = (double) rhsStats.ndv / (double) lhsStats.ndv; double rhsSelectivity = (double) rhsCard / (double) rhsStats.numRows; long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio); // FK/PK join cardinality must be <= the lhs cardinality. joinCard = Math.min(lhsCard, joinCard); if (fkPkJoinCard == -1) { fkPkJoinCard = joinCard; } else { fkPkJoinCard = Math.min(fkPkJoinCard, joinCard); }
Query
select sum (ss_quantity) from store_sales,customer_demographics where ( ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'M' and cd_education_status = '4 yr Degree' and ss_sales_price between 100.00 and 150.00 ) or ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'N' and cd_education_status = '8 yr Degree' and ss_sales_price between 50.00 and 100.00 ) or ( cd_demo_sk = ss_cdemo_sk and cd_marital_status = 'S' and cd_education_status = '6 yr Degree' and ss_sales_price between 150.00 and 200.00 ) )
Plan
| PLAN-ROOT SINK | | | | | 06:AGGREGATE [FINALIZE] | | | output: sum:merge(ss_quantity) | | | hosts=7 per-host-mem=unavailable | | | tuple-ids=2 row-size=8B cardinality=1 | | | | | 05:EXCHANGE [UNPARTITIONED] | | | hosts=7 per-host-mem=unavailable | | | tuple-ids=2 row-size=8B cardinality=1 | | | | | 03:AGGREGATE | | | output: sum(ss_quantity) | | | hosts=7 per-host-mem=10.00MB | | | tuple-ids=2 row-size=8B cardinality=1 | | | | | 02:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: ss_cdemo_sk = cd_demo_sk | | | other predicates: ((cd_marital_status = 'M' AND cd_education_status = '4 yr Degree' AND ss_sales_price >= 100.00 AND ss_sales_price <= 150.00) OR (cd_marital_status = 'N' AND cd_education_status = '8 yr Degree' AND ss_sales_price >= 50.00 AND ss_sales_price <= 100.00) OR (cd_marital_status = 'S' AND cd_education_status = '6 yr Degree' AND ss_sales_price >= 150.00 AND ss_sales_price <= 200.00)) | | | runtime filters: RF000 <- cd_demo_sk | | | hosts=7 per-host-mem=101.90MB | | | tuple-ids=0,1 row-size=67B cardinality=2847058876 | | | | | |--04:EXCHANGE [BROADCAST] | | | | hosts=7 per-host-mem=0B | | | | tuple-ids=1 row-size=51B cardinality=1920800 | | | | | | | 01:SCAN HDFS [tpcds_1000_parquet.customer_demographics, RANDOM] | | | partitions=1/1 files=7 size=8.02MB | | | table stats: 1920800 rows total | | | column stats: all | | | hosts=7 per-host-mem=48.00MB | | | tuple-ids=1 row-size=51B cardinality=1920800 | | | | | 00:SCAN HDFS [tpcds_1000_parquet.store_sales, RANDOM] | | partitions=1824/1824 files=1843 size=127.57GB | | runtime filters: RF000 -> ss_cdemo_sk | | table stats: 2879987999 rows total | | column stats: all | | hosts=7 per-host-mem=240.00MB | | tuple-ids=0 row-size=16B cardinality=2879987999