[IMPALA-8217] Remove scan operator exponential backoff in cardinality calcs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 3.1.0
Fix Version/s: None
Component/s: Frontend
Labels:
None

Epic Color:
ghx-label-3

Description

The planner's scan nodes use an exponential back-off to compute the joint filter selectivity:

    double result = 1.0;
    for (int i = 0; i < selectivities.size(); ++i) {
      // Exponential backoff for each selectivity multiplied into the final result.
      result *= Math.pow(selectivities.get(i), 1.0 / (double) (i + 1));
    }

The reasoning is:

  /*
   * 2. Two selectivities, whether known or unknown, could be correlated. Assuming
   *    independence can lead to significant underestimation.
   *
   * The second issue is addressed by an exponential backoff when multiplying each
   * additional selectivity into the final result.
   */

The logic makes some kind of sense, especially since Impala calculates selectivity only for one kind of predicate: col = const, and ignores selectivity for all others. (Actually, replaces all others with a single combined selectivity estimate of 0.1.}}

The problem, however, is that this violates the commutative and linear models that underly relational theory. In general, relational theory says it does not matter if we join then filter, or filter, then join. If we have two filters, f1 and f2, it does not matter if we apply f1 in the scan and f2 after the join: the resulting result set is the same. (Performance differs, but we ignore that when computing cardinality.)

With exponential back-off, this no longer applies. It becomes difficult, say, to address the issue in IMPALA-XXXX in which we need to apply filter selectivity at the scan level, then remove it again at the join level. The reason is that the cardinality applied at the scan level may have been backed off so simply using the expression's report of its cardinality may not be the value used by the scan.

Also, oddly, the same expression could apply different selectivity in different scans. The code sorts selectivities in ascending order. So, in scan 1, filter f1 might apply its full selectivity, while in scan 2 (assuming a correlated filter), it might apply the square or cube of the selectivity.

Solution

The solution here is simply to:

Compute the selectivity for all predicates so we don't have to fudge.
Remove correlated predicates from join calcs (See IMPALA-XXXX).
Remove exponential back-off calculations so that filter selectivity can be properly decomposed for the above task.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Feb/19 18:44

Updated:: 21/Feb/19 23:57