[IMPALA-8262] Join cardinality not decreased by join filter selectivity - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 3.1.0
Fix Version/s: None
Component/s: Frontend
Labels:
None

Epic Link:
Impala 5
Epic Color:
ghx-label-7

Description

Consider a subset of the plan for TPC-H query 7. (See tpch-all.test for details.)

11:AGGREGATE [FINALIZE]
|  output: sum(l_extendedprice * (1 - l_discount))
|  group by: n1.n_name, n2.n_name, year(l_shipdate)
|  row-size=58B cardinality=575.77K
|
10:HASH JOIN [INNER JOIN]
|  hash predicates: c_nationkey = n2.n_nationkey
|  other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
|  row-size=132B cardinality=575.77K
|
|--05:SCAN HDFS [tpch.nation n2]
|     row-size=21B cardinality=25
|
09:HASH JOIN [INNER JOIN]
|  hash predicates: s_nationkey = n1.n_nationkey
|  row-size=111B cardinality=575.77K

Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass along to the aggregate 11. Notice, however, that join 10 has a that picks out 2 of the 25 countries in each of two paths. The selectivity of the filters should be something like 2 * 2/25 = 0.16. Thus, the output cardinality of the 10 join should be 577K * 0.16 = 92K.

The problem is that the join cardinality calculations don't consider join filter selectivity.

It may be that this was done to handle the outer join case, in which filters applied in the outer-side scan must be re-applied on the join. Omitting the filters avoids duplicate accounting for the selectivity.

But, that case is special and should be handled specially as part of IMPALA-8213. Except for correlated filters, the planner should apply join filter selectivity to the join output cardinality calculations.

This error has consequences. The filter should reduce the number of rows though the join. Because it does so, it should come early in the join tree to reduce the set of rows processed. But, because selectivity is ignored, the planner does not see the join as a filter, and ends up putting the join 10 at the top of the join tree. (See the test file for the full plan.) The result is that Impala schleps around many more rows than necessary, only to discard them near the top of the DAG.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Feb/19 06:10

Updated:: 20/Feb/24 18:59