Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8048

Improve join cardinality estimation: urn model, NDV tracking, etc.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.1.0
    • None
    • Frontend
    • None

    Description

      Work is underway on a number of JIRA tickets to improve cardinality estimates. That work is constrained by the possible need to back-port to prior releases. As a result, the changes are made within the existing context to minimize the impact.

      The current model makes a number of naive assumptions, however, that should be addressed in a second batch of changes which will entail a wider code impact.

      Adopt the Urn Model for NDV Estimation.

      Suppose we have a table alumni(name, sex, class) with values such as:

      John Smith, M, 2008
      Jane Doe, F, 1993
      ...
      

      We have 50 years of data, 1000 rows per year, or 50K rows. We have these stats:

      |alumni| = 50K
      |name| = 49K
      |sex| = 2
      |class| = 50
      

      We have the following query which fills in the graduation date for each class:

      select * from alumni, grad_dates where sex='F' where alumni.class = grad_dates.class
      

      Focusing just on the alumni table, how many classes will be available to match? That is, what is |class'|, the NDV of the name field after accounting for the affect of the predicate sex='F'.

      Today we work it out with a linear model as follows:

      sel(sex = 'F') = 1/|F| = 1/2 = 0.5
      |sex'| = |sex| * sel(sex = 'F') = 2 * 0.5 = 1
      |class<span class="code-quote">'| = |class| * sel(sex = 'F') = 50 * 0.5 = 25
      

      The math works for the sex field: the correct adjusted NDV is 1.

      What about for class? Since the predicate eliminated half the rows, it eliminated half the class values. But, this can't be right. Surely women graduated in all classes. What went wrong?

      The problem is the linear assumption. As shown in the SwamiI and Schiefer paper, Section 5, the correct estimation technique is the urn model. See the paper for details. Using that model:

      |x'| = (1 - (1 - 1/|x|)^|T')
      
      |alumni'| = |alumni| * sel(sex = 'F') = 50K * .5 = 25K
      |class'| = |class| * (1 - (1 - 1/50) ^ 25K) = 50
      

      That is, as the cardinality of the selected table grows larger, the probability reaches 1 that other, non-correlated values will still appear. This, though we remove half the rows, all the classes are still represented.

      Per-Tuple Column NDV Tracking

      At present, after the current round of changes, we use a linear model to estimate column NDV after filtering, and use the same model for all columns. If we adopt the urn model, then we must treat columns separately. In the above, we do not want to apply the urn model to the sex column. Why? We already know its cardinality from the filter predicate. Don't want to replace it with an estimated urn-model value. This problem is more acute if you consider a range predicate, such as those used on partitions: class > 2009.

      To make the above work, we have to track NDV per column. That is, the scan node must provide a list of columns and their NDVs after scanning. Columns mentioned in a predicate have their NDVs estimated from selectivity. All other columns have their NDVs estimated from the urn model. (There are several ways to implement this; the point is that some columns must be singled out for special treatment.)

      Proper Join-to-Table Join Column NDVs

      The NDV adjustment model says that, to compute the join cardinality, we need the adjusted column cardinality (NDV). When joining one table to another, it is clear how to adjust the column NDVs for each table: each is done according to the rules spelled out above.

      A complexity arises, however, when we want to join three tables: we have ((A ⋈ B) ⋈ C). How do we adjust the NDVs for the columns created by the (A ⋈ B) join? If we simply adjust the NDV of table columns using a common selectivity (as done in the simple linear model), then we are correct for the columns from one table, but wrong for columns from the other. Why? The two table had different selectivities applied, we can't reduce them to a common number.

      The solution is the per-column adjusted NDV tracking: we'd know to apply one set of adjustments for columns from the left table, another for the right.

      This requires additional data structures in each plan node.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Paul.Rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: