Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7601

Improve cardinality and selectivity estimates

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.0
    • None
    • Frontend
    • None

    Description

      Impala makes extensive use of table stats during query planning. For example, the NDV (number of distinct values) is used to compute selectivity, the degree of reduction (also called the reduction factor) provided by a predicate. For example:

      SELECT * FROM t WHERE t.a = 10
      

      If we know that t.a has an NDV=100, then we can predict (given a uniform distribution of values), that the above query will pick out one of these 100 values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity of the predicate t.a = 10 is 0.01.

      Selectivity is then used to compute cardinality the number of rows in some relation (set of tuples.) Cardinality is used to pick among plan options, such as which relation to put on the probe vs. build side of a join.

      This ticket explains several cases in which Impala fails to produce a cardinality estimate, and proposes solutions.

      Problem Summary

      The goal of a planner is to estimate the cardinality of various relations (sets of tuples) in order to choose a good plan (which relation to put on the probe side of a join) and for memory estimates (about how much memory is needed for a sort?) When planning, even crude estimates are fine as the planner only chooses among a discrete set of options. The smaller table goes on the probe side of a join, it does not matter how much smaller one table is than the other.

      It turns out, however, that Impala is quite finicky: refusing to render an cardinality estimate if certain information is missing. This means that, rather than use a poor estimate, Impala would rather use no estimate at all. Rather than produce a crude attempt at a plan, Impala flies blind in this cases.

      You can see this in PlanNode.computeCombinedSelectivity().

      The result is that Impala is a bit more strict than classic DB optimizers. If stats are present, they are used. If stats are not present, Impala flies blind.

      This ticket proposal a number of interrelated changes to add a-priori (before observation) defaults for selectivity and NDV based on classic DB practice.

      Also, some of our existing estimates are overly simplified as discussed below.

      A Bit of Math

      Selectivity is a probability: the probability that an Boolean expression evaluates to true. That is:

      selectivity c = x = p( (c  = x) = true) = p(c = x)
      

      Where c is a column and x is a constant.

      Knowing this can help pick good selectivity values, and makes the following discussion a bit simpler.

      Use NDV to Compute p(c != x)

      Impala uses NDV to compute p(c=x):

      p(c = x) = 1 / NDV(c)
      

      As described in IMPALA-7560, Impala uses a selectivity of 0.1 for p(c != x). There is, however, a mathematical relationship between these operators we can exploit:

      p(c != x) = 1 - p(c = x)
      

      That is, if we have NDV(c) and can compute p(c = x), we can also compute p(c != x).

      Refine Default Selectivity Values

      Further, Impala assumes a selectivity 0.1 for all other relational operators except equality. See Impala:

        // To be used where we cannot come up with a
        // better estimate (selectivity_ is -1).
        public static double DEFAULT_SELECTIVITY = 0.1;
      

      There is another mathematical relationship between these operators we can exploit to refine some of these estimates:

      p(c != x) = p(c < x) + p(c > x)
      

      As it turns out, we don't need the NDV to compute an inequality. We can simply observe that the inequalities roughly divide the data into two sets, of which the user picks one. The current default of 0.1 is probably too low, and 0.5 is probably too high. The Ramakrishnan and Gehrke book suggests rule-of-thumb estimates:

      • p(c < x), p(c <= x), p(c > x), p(c >= x) - 0.3
      • p(c BETWEEN x AND y) - 0.25

      Default Selectivity for p(c = x) and p(c != x) Without Stats

      All this is good. But, what happens if statistics are not available for table t? How are we to know the selectivity of the predicate?

      Today, Impala simply refuses to pick a selectivity for p(c = x) if there are no stats. See BinaryPredicate.analyzeImpl().

      However, even without stats, Impala continues to use its defaults for all other relational operators. While it is true that, without NDV, we can't be positive of the selectivity of p(c = x), it is also true we are happy to guess for all other operators.

      The Ramakrishnan and Gehrke book shows that the standard choice is to assume a selectivity of 0.1.

      Over in the Drill project, DRILL-5254 attempted to work out better estimates for other operators based on math and probability. However, the conclusion there was that, without NDV and histograms, there is more information in the user's intent than in the math. That is, if the user writes WHERE t.a != 10, there is a conditional probability that the user believes that this is a highly restrictive predicate, especially on big data. So, the reduction factor (which is a probability) is the same for = and != in the absence of information. The same reasoning probably led to the rule-of-thumb values in the R&G book.

      So, if no stats are available, use a default number for p(c != x) such as 0.1 or 0.7. (The larger number is probably better.)

      As a result of this change, we will always have a selectivity, even if just a rule-of-thumb one. That is, some information is better than none.

      Special Case Boolean Columns

      There is one special case it may be worth exploiting. If a column is Boolean, then we know that it can have at most three values (true, false, null). So, even without status, we can estimate NDV(c) = 3. This allows us to get better estimates for p(c = x) and p(c != x) than we'd get just using the defaults.

      Separate the NDV=0 and Unknown NDV cases

      As described in IMPALA-7310, Impala does not currently handle the case of a table with stats, but a column in the table contains only NULL values. The stats mechanism defines NDV as "number of distinct non-null values". So, a table full of nulls has NDV = 0. Unfortunately, if no stats are available at all, we also have NDV = 0.

      IMPALA-7310 will fix this issue, removing one more case in which Impala did not produce a cardinality estimate.

      Rationalize the Planner's and Stat's NDV Definition

      Related to the above issue is the fact that the planner and the stats mechanisms use different definitions for NDV.

      • Planner: NDV includes nulls. This is seen when computing the NDV for constant expressions.
      • NDV function and thus stats: NDV does not include nulls.

      Currently, the planner takes the stat's "without nulls" NDV and mixes it with the planer's "with nulls" definition. IMPALA-7310 proposes a way to convert the stat's definition to the planner's definition.

      Estimate Row Counts

      The above will go a long way to ensure that the planner always renders a cardinality estimate. But, one hole remains. IMPALA-7608 says that cardinality estimates are undefined if stats are unavailable because we don't know row counts. IMPALA-7608 explains a useful, common rule-of-thumb. Add that and we'd have covered all situations in which Impala currently refuses to produce a cardinality estimate.

      h.4 Handle Large Cardinality

      IMPALA-7604 notes that we use a long to hold a cardinality estimate. This can overflow, causing incorrect estimates. As we add the above, we will put more confidence in the cardinality estimates. We need to fix the overflow as it can reduce the value of the cardinality by producing wrong numbers in extreme cases.

      Remove Special Handling for Undefined Selectivity and Cardinality

      Once all of the above are available, every plan node will compute a cardinality. The existing code to handle cardinality = -1 (undefined) can be removed. Tests can be updated to enforce that all plans will have an estimated cardinality.

      References

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Paul.Rogers Paul Rogers
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: