[IMPALA-8045] Rollup of Smaller Join Cardinality Issues - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 3.1.0
Fix Version/s: None
Component/s: Frontend
Labels:
None

Epic Color:
ghx-label-2

Description

The work to review join cardinality has found some major issues recorded as JIRA tickets. This ticket records a number of smaller issues. Some of these issues are a bit tricky because they appear only when some of the other issues are resolved. Reporting them directly could be misleading.

ScanNode confusion between table and scan input cardinality

The ScanNode class in the scanner contains an inputCardinality_ field used by join calculations as a proxy for the table size. However, the actual scan node implementations set the inputCardinality_ to the estimated number of rows read by the scan, which is useful when understanding the physical scan structure. But, for joins, we need the base table cardinality.

For example, the join may use the input cardinality to understand the reduction in rows due to filters in order to adjust the NDV of key columns. But, since the input cardinality is the scan count, not the table row count, the math does not work out.

The solution is to clarify the code to separate the idea of scan count vs. base table row count.

Selectivity Confusion

Similarly, each node computes its selectivity. However, the selectivity is only for those predicates that will be applied via a projection. Predicates that can be applied because of partition pruning (HDFS), key range pruning (HBase) and so on do not "count". While this produces accurate execution estimates, it is not helpful for join planning.

In join planning, we need to know the number of filtered rows relative to the total table cardinality. This allows us to adjust HDV key cardinality in order to estimate the number of rows produced by the join.

Using the partial selectivity, or partial input cardinality (above issue) causes inaccurate key cardinality adjustments and incorrect join cardinality estimates.

Join Node Does not Apply Selectivity from Its Predicates

A join node can have "additional predicates" applied after creating a join row. Accurate estimation of join cardinality must include the selectivity from those predicates, but is not currently done. Perhaps because such predicates, in the current estimation scheme, always produce an estimated selectivity of .1. This will be more important as we add more realistic estimates.

Use Double, not Long for Cardinality Values

In scan nodes, row counts can be reasonable numbers and a Java long is fine. But, once one starts computing join cardinalities, values can grow fast, especially for cross joins. The code currently has special checks to limit products to Long.MAX_VALUE. While this solves the overflow issue, it has undesirable downstream affects. First, it throws of selectivity calculations since the reported cardinality is not the real cardinality. Second, it requires special math calls whenever we multiply cardinalities.

Much simper to work with a double. When values get large, the extra precision from a integer value is completely lost in the noise of assumptions and estimations.

Revisit Cardinality Calcs for Join Nodes

The method JoinNode.computeStats() is a bit muddled. It starts by computing cardinality depending on the major type family (semi-join, inner/outer join, cross join). It then revises those calcs based on the specific join type. This makes it very hard to follow the logic case we have to follow two distinct blocks of code. There is also redundancy. The cross join cardinality is calculated twice, for example.

Refactor to have a cardinality/selectivity calculation per join type.

Disallow Unknown Cardinality

Multiple nodes can produce a cardinality of -1 (unknown). Since it is impossible to plan based on an unknown cardinality, we must have an estimate, however good or bad. For cases where we have no stats, estimate cardinality based on other factors. If we have no column NDV, perhaps guesstimate something, or use an alternative join calculation that avoids the need for NDV (while producing much cruder estimates.) However, refusing to play the game at all is not helpful unless we choose to fail the query for lack of stats.

Revisit Join Cardinality Limit

The JoinNode has several methods that limit cardinality:

  public void computeStats(Analyzer analyzer) {
    ...
        cardinality_ = capCardinalityAtLimit(cardinality_);
    ...
  }

  public boolean hasLimit() { return limit_ > -1; }

  protected long capCardinalityAtLimit(long cardinality) {
    if (hasLimit()) {
      return capCardinalityAtLimit(cardinality, limit_);
    }
    return cardinality;
  }

It is not clear when or why we apply a limit. Perhaps as part of LIMIT x processing? Revisit if the limit is helpful, and remove it if not. Imposing a limit throws off downstream joins, which is probably not what is wanted here.

Properly Handle Duplicated Filters in Outer Joins
Consider the following query on TPC-H which picks 1/3 (newer version) or 1/10 (older version) of orders. It then joins them with their customers:

select c.c_custkey, o.o_orderkey
from tpch.customer c
left outer join tpch.orders o on c.c_custkey = o.o_custkey
where o.o_clerk < 'foo'

The plan produced places the WHERE clause predicate on both the scan and join nodes:

PLAN-ROOT SINK
|
02:HASH JOIN [RIGHT OUTER JOIN]
|  hash predicates: o.o_custkey = c.c_custkey
|  other predicates: o.o_clerk < 'foo'           <== Huh?
|  row-size=51B cardinality=163.35K
|
|--00:SCAN HDFS [tpch.customer c]
|     partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
|
01:SCAN HDFS [tpch.orders o]
   partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=495.00K
   predicates: o.o_clerk < 'foo'                  <== Obvious location

The filter must be applied twice because the outer join will produce null order rows. The filter won't match if the clerk field is null, so the predicate is applied again.

However, the meaning of the predicate in the join is "remove null rows" which kind of means to undo the "outer". Given that, such a query is rather meaningless.

The point here is to properly account for the predicate. Simply applying the sel(predicate) value twice is clearly wrong. We actually want to apply the predicate only to those rows that were generated in the outer join to avoid double-accounting.

This is an obscure corner-case; the question is whether we need to account for this kind of issue in multiple places. If we do, we need a more sophisticated set of data models to account for predicates than is currently used.

See test case in card-joins.test that references this ticket number for an example.

Better Handling of OUTER JOIN with Column Filter

Consider:

SELECT c.c_name, o.o_orderkey
FROM customers c RIGHT OUTER JOIN orders o
WHERE c.c_name = 'Bob'

The query runs the "c.c_name = 'Bob'" predicate twice: once in the scan and a second time in the join. Why in the join? To check if the null left (customer) columns match the predicate (which they don't.) The revised code handles this more-or-less correctly using the NDV of "c_name". But, doing so is subtly wrong.

We want to know the effect of doing the filtering past the join. We know that all the rows that match do have the name "Bob", so the NDV should be 1. The only non-"Bob" rows are the null rows inserted by the join. So, we should use logic that is aware of this case to provide a better estimate.

On the other hand, If the predicate where "WHERE c.c_name is null", the results would be far different. So, the outer join logic should also be aware of the meaning (null handling) of the predicate. In this case. all the inserted null rows match, so we should not use the null count or NDV to guess the filtering.

Point is, this area is subtle and can't be brute-forced.

Rollup of Smaller Join Cardinality Issues