[IMPALA-8040] Incorrect double-accounting for WHERE on join column - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 3.1.0
Fix Version/s: None
Component/s: Frontend
Labels:
None

Epic Color:
ghx-label-4

Description

When a column appears as both a WHERE clause entry and a join condition, the planner incorrectly reduces estimated join cardinality twice for the same column.

To see this, start with the simple case, join and WHERE on different columns.

select m.c_custkey, d.o_custkey
from tpch.customer m,
     tpch.orders d
where m.c_custkey = d.o_custkey
  and m.c_name = 'foo'

Here, c_name is unique, so only one row matches the filter. There are, on average, 10 rows per customer, so joining the one customer with its orders should return 10 rows.

Plan:

02:HASH JOIN [INNER JOIN]
|  hash predicates: d.o_custkey = m.c_custkey
|  fk/pk conjuncts: d.o_custkey = m.c_custkey
|  tuple-ids=1,0 row-size=46B cardinality=16
|
|--00:SCAN HDFS [tpch.customer m]
|     partitions=1/1 files=1 size=23.08MB
|     predicates: m.c_name = 'foo'
|     tuple-ids=0 row-size=38B cardinality=1
|
01:SCAN HDFS [tpch.orders d]
   partitions=1/1 files=1 size=162.56MB
   runtime filters: RF000[bloom] -> d.o_custkey
   tuple-ids=1 row-size=8B cardinality=1500000

Notice the join cardinality of 16. (Should be 10, see IMPALA-8014).

Now, do basically the same thing, but select that one customer base on ID, not name.

select m.c_custkey, d.o_custkey
from tpch.customer m,
     tpch.orders d
where m.c_custkey = d.o_custkey
  and m.c_custkey = 10

Logically, we still get one customer and that customer still joins with its 10 orders. But this time, we get the following plan:

02:HASH JOIN [INNER JOIN]
|  hash predicates: d.o_custkey = m.c_custkey
|  fk/pk conjuncts: d.o_custkey = m.c_custkey
|  tuple-ids=1,0 row-size=16B cardinality=1
|
|--00:SCAN HDFS [tpch.customer m]
|     partitions=1/1 files=1 size=23.08MB
|     predicates: m.c_custkey = CAST(10 AS BIGINT)
|     tuple-ids=0 row-size=8B cardinality=1
|
01:SCAN HDFS [tpch.orders d]
   partitions=1/1 files=1 size=162.56MB
   predicates: d.o_custkey = CAST(10 AS BIGINT)
   tuple-ids=1 row-size=8B cardinality=15

Notice that, in the orders scan, the planner has applied the m.c_custkey = 10 to reduce the scan count, which is good. (Though, the cardinality estimate is wrong, it should be 10 for TPC-H.)

Then, we use the normal join calcs to assume that those 15 rows are matched against the set of 150K customers. Since there is a very low probability of a match, we guess about 1 row.

The problem is, we are using the same column twice. If we already reduced the orders scan by the one ID we are looking for, we can't then assume, in the join, that we have the full range of IDs.

The correct join cardinality would still be 10 if we handled this case correctly.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Dec/18 23:20

Updated:: 21/Feb/19 23:59