[IMPALA-8213] Planner does not adjust join cardinality for correlated filters - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 3.1.0
Fix Version/s: None
Component/s: Frontend
Labels:
None

Epic Color:
ghx-label-6

Description

The planner does not consider column join-equivalence when working out join cardinality on scans with filtering, leading to unrealistically low estimates.

Here is a snippet from a real query:

WHERE ...
   table_a.part_id = 123 
OR table_a.part_id = 234

The query has four other tables in addition to table_a. The WHERE clause equates the part_id column in all of them (in addition to other FK/PK pairs.)

The planner correct works out join equivalence: it notices, via the join expressions, that the part_id all take the same values. The planner then pushes the above predicate into all the scans:

|  |  04:SCAN HDFS [table_i, RANDOM]                                     
|  |     partition predicates: table_i.part_id = CAST(123 AS BIGINT) OR table_i.part_id = CAST(234 AS BIGINT)

06:SCAN HDFS [table_u, RANDOM]                                                  
   partition predicates: table_u.part_id = CAST(123 AS BIGINT) OR table_u.part_id = CAST(234 AS BIGINT)

So far, so good. The problem is that the join cardinality calculations does not realize that all tables are filtered (in part) on the same columns. That is, the join calculations assume that each table is filtered independently, when in fact the filtering is strongly correlated.

Problem Detail

To see this, imagine three tables, A, B, and C. They are all have a part_id column. A has 10K rows, B and C have 1K rows. Let A be a fact table, and B and C be dimension tables. We create a simplified form of the DB query:

SELECT count(*) FROM A, B, C
WHERE A.part_id = 123
  AND A.part_id = B.part_id
  AND A.b_id = B.id
  AND A.part_id = C.part_id
  AND A.c_id = C.id

Assume that B(part_id, id) and C(part_id, id) are primary keys. Assume |B.part_id| = 10 and |C.part_id| = 10. (That is, there are 10 partitions, each with 100 unique id values, so that |part_id| * |id| = |B|, etc.

If we use the (corrected) FK/PK join calculations we get:

|A’| = |A| * sel(part_id = 123) = |A| * (1 / NDV(part_id)) = |A| / 10 = 10000 / 10 = 1000
|B’| = as above, but for B
|C’| = as above, but for C
|A’ >< B’| = |A’| * |B’| / |B| = 1000 * 100 / 1000 = 100
|(A’ >< B’) >< C’| = |A’ >< B’| * |C’| / |C| = 100 * 1000 / 100 = 10

Is this the correct answer? In general, if the filtering of tables A an B were independent, then it would be correct. But, here, the filtering is correlated. We want to account for that filtering once, not twice. We then compound the error on the next join by again assuming filtering is independent. By the second join, we are off by a factor of 100.

This issue is discussed in Swami & Schiefer, On the Estimation of Join Result Sizes (S&S), which makes this useful observation to simplify our analysis: “For the estimation of the result size, it does not matter in which order the predicates are applied.” (That is, we get the same answer if we filter then join, or join then filter — this is why query optimizers are complex: they must consider all such options.)

Using this idea here, let’s join first, then filter second:

|A >< B| = |A| * |B| / |B.pk| = 10,000 * 1000 / 1000 = 10,000
|(A >< B) >< C| = |A >< B| * |C| / |C.pk| = 10,000 * 1000 / 1000 = 10,000
sel(part_id = 123) = 0.1
|((A >< B) >< C)’| = |(A >< B) >< C| * sel(part_id = 123) = 10,000 * 0.1 = 1000

Here B.pk is shorthand for the compound primary key (part_id, id. Being a primary key, we know |B.pk| == |B|.

We join the fact table A with two dimension table B and C, with no filtering. Intuitively, we end up with all the fact rows “enriched” with columns from B and C. Then we apply filtering on the common part_id column to get 1/10 of the rows or 1000.

Backing Out Duplicated Filters

The calculations above use only multiplication and division and thus are commutative. (The fancy way of repeating the observation that order does not matter.) So, we can use a hybrid solution: apply filtering at each scan, then back out redundant filtering at the join level. Suppose we have a predicate f (for filter) applied to both tables. The selectivity of predicate f is sel(f). Then:

|A'| = |A| * sel(f)
|B'| = |B| * sel(f)

Now we can see the problem with the incorrect rule that Impala uses today:

|A' >< B'| = |A'| * |B'| / |B|
           = |A| * sel(f) * |B| * sel(f) / |B|
           = |A| * sel(f) * sel(f)

Notice that we've applied the filter twice rather than once as per our earlier example. To correct this, we have to remove one of the applications of the filter. That is, we need a correction, dividing out one of the sel(f) terms:

|A' >< B'| = |A'| * |B'| / |B| / sel(f)
           = |A| * sel(f) * |B| * sel(f) / |B| / sel(f)
           = ( |A| |B| / |B| )  * sel(f)
           = |A| * sel(f)

Intuitively, both the master and detail tables are filtered by f. The number of detail rows is |A| * sel(f). Since the filter is applied to both tables, using the principal of containment, each filtered detail row from A still finds a matching master row in (the filtered subset of) A.

Note that, in practice, each side will have additional non-correlated filters, so the real-world case is not as simple as the above case.

Proposed Solution

The solution is straightforward: when the same filter is applied on both sides of a join, remove the filter from the final join number. To do this, we must track which predicates are common and which occur only one one side or the other. We can see this by extending the above overly simple query with an additional predicate on the B table:

WHERE ...
  B.state = 'CA'

Now, the A and B tables are filtered by part_id as before, but B is additional filtered by state. We have to account for this.

In the join, consider all filters applied on the inputs:

{{part_id = 123} - on both A' and B'
state = 'CA' - on only B'.

Back out from the RHS cardinality the selectivity of any duplicated predicate:

rhs_adj = ∏ sel(shared predicate i)

|RHS''| = |RHS'| / rhs_adj

|LHS' >< RHS'| = |LHS'| * |RHS''| / |RHS|

               = |LHS'| * |RHS'| / |RHS| / rhs_adj

Fortunately, most of the pieces needed exist in the code; we just need to wire them up to allow the above computation:

Create set PR as the set of all predicates that were previously "assigned" to the RHS.
Similarly, create set PL for the LHS.
The LHS side can be a join, so for each join, track all predicates assigned to either side of the join, or to the join itself.
Compute the set CP, the set of common predicates, as CP = PR ⋂ PL: the intersection of the two sets.

With that, we can compute rhs_adj from above.

Existing Code

As shown in IMPALA-8014, the code appears to attempt to do the above adjustment. But, the code performs the adjustment in all cases, not just in the case of redundant predicates. The result turns out to not work in either case, unfortunately.

Planner does not adjust join cardinality for correlated filters

Details

Description

Problem Detail

Backing Out Duplicated Filters

Proposed Solution

Existing Code

Attachments

Activity

People

Dates