Details
Description
The cardinality of an inner join may be significantly underestimated if (1) an equivalent predicate exists on both join inputs, (2) the join condition involves the same column as that predicate, and (3) Impala believes the join to be FK/PK.
The reason for this underestimation is that the planner double-counts the selectivity of predicates on the join input:
- First, the selectivity reduces the cardinality of the join input
- Second, since the join is FK/PK, the build-side selectivity is applied to the join cardinality
This second adjustment is not correct in this specific situation because the predicate selectivity has already been applied to the probe-side join input.
Example:
explain select count(*) from functional.alltypes a join functional.alltypes b on a.id = b.id and a.id < 10 and b.id < 10;
+------------------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=4.00MB |
| Per-Host Resource Estimates: Memory=279.94MB |
| Codegen disabled by planner |
| |
| F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 |
| | Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B |
| PLAN-ROOT SINK |
| | mem-estimate=0B mem-reservation=0B |
| | |
| 07:AGGREGATE [FINALIZE] |
| | output: count:merge(*) |
| | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB |
| | tuple-ids=2 row-size=8B cardinality=1 |
| | |
| 06:EXCHANGE [UNPARTITIONED] |
| | mem-estimate=0B mem-reservation=0B |
| | tuple-ids=2 row-size=8B cardinality=1 |
| | |
| F02:PLAN FRAGMENT [HASH(a.id)] hosts=3 instances=3 |
| Per-Host Resources: mem-estimate=12.94MB mem-reservation=2.94MB runtime-filters-memory=1.00MB |
| 03:AGGREGATE |
| | output: count(*) |
| | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB |
| | tuple-ids=2 row-size=8B cardinality=1 |
| | |
| 02:HASH JOIN [INNER JOIN, PARTITIONED] |
| | hash predicates: a.id = b.id |
| | fk/pk conjuncts: a.id = b.id |
| | runtime filters: RF000[bloom] <- b.id |
| | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB |
| | tuple-ids=0,1 row-size=8B cardinality=73 <--- should be 730 |
| | |
| |--05:EXCHANGE [HASH(b.id)] |
| | | mem-estimate=0B mem-reservation=0B |
| | | tuple-ids=1 row-size=4B cardinality=730 |
| | | |
| | F01:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 |
| | Per-Host Resources: mem-estimate=128.00MB mem-reservation=32.00KB |
| | 01:SCAN HDFS [functional.alltypes b, RANDOM] |
| | partitions=24/24 files=24 size=478.45KB |
| | predicates: b.id < 10 |
| | stored statistics: |
| | table: rows=7300 size=478.45KB |
| | partitions: 24/24 rows=7300 |
| | columns: all |
| | extrapolated-rows=disabled |
| | parquet dictionary predicates: b.id < 10 |
| | mem-estimate=128.00MB mem-reservation=32.00KB |
| | tuple-ids=1 row-size=4B cardinality=730 |
| | |
| 04:EXCHANGE [HASH(a.id)] |
| | mem-estimate=0B mem-reservation=0B |
| | tuple-ids=0 row-size=4B cardinality=730 |
| | |
| F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 |
| Per-Host Resources: mem-estimate=129.00MB mem-reservation=1.03MB runtime-filters-memory=1.00MB |
| 00:SCAN HDFS [functional.alltypes a, RANDOM] |
| partitions=24/24 files=24 size=478.45KB |
| predicates: a.id < 10 |
| runtime filters: RF000[bloom] -> a.id |
| stored statistics: |
| table: rows=7300 size=478.45KB |
| partitions: 24/24 rows=7300 |
| columns: all |
| extrapolated-rows=disabled |
| parquet dictionary predicates: a.id < 10 |
| mem-estimate=128.00MB mem-reservation=32.00KB |
| tuple-ids=0 row-size=4B cardinality=730 |
+------------------------------------------------------------------------------------------------+