Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8014

Incorrect M:1 (FK/PK) cardinality estimation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.1.0
    • None
    • Frontend
    • None
    • ghx-label-1

    Description

      The math is wrong when we estimate the cardinality of a join we've labeled as "FK/PK" (commonly known as many-to-one or M:1.)

      Join Estimation

      The standard calculation for joins is explained in Swami & Schiefer, On the Estimation of Join Result Sizes (S&S). Note especially section 3, Background. Also see How Good Are Query Optimizers, Really? by Leis et al.

                      |L'| * |R'| 
      |L' ⋈ R'| = ------------------
                  max(|L.k'|, |R.k'|)
      

      Here:

      • |x| is the row cardinality if x is a relation, the NDV (domain cardinality) if x is a column.
      • L, R are left and right input relations.
      • L.k and R.k are the (possibly compound) join keys.
      • x’ is the result of applying a filter to relation or column x.

      Intuitively, the cardinality of the join is the Cartesian product reduced by the largest column domain. Since the relations fed into the join are filtered, we are concerned with the new, filtered relations created by the scan nodes.

      Impala, like most planners, makes three basic assumptions (see the first paper above):

      • Uniformity: keys are evenly distributed.
      • Independence: the filtering on the two tables is independent of keys.
      • Containment: that all detail (FK) rows have a corresponding master (FK) row.

      (These are traditional assumptions, but they turn out to be unrealistic in many cases; see the second paper above.)

      M:1 Case

      The planner (unnecessarily) divides join planning into two cases. (IMPALA-8018 describes how the two steps are unnecessary. But, to minimize code change, we live with this process here.)

      • M:1 (AKA "FK/PK", "detail/master", "fact/dimension"). A "foreign key" (FK) in the detail table matches at most one "primary key" (PK) in the master table. ("At most one" because of filtering which may have removed master keys.) Impala assumes that the detail ("FK") table is on the left (probe) side, and the master ("PK") table is on the right ("build") side.
      • M:N (AKA "generic", "many-to-many"). Every key on the left matches potentially many keys on the right.

      If we focus on just the M:1 case we can rename the relations for convenience and observe a simplification:

      L = D
      R = M
      L.k = M.pk
      R.k = D.fk
      

      By definition of M:1:

      |M.pk'| = |M'|
      

      Which gives a revised join expression:

                      |D'| * |M'| 
      |D' ⋈ M'| = ------------------
                  max(|D.fk'|, |M'|)
      

      The expression above is handy: it shows that we only need three values to compute the join cardinality:

      • |D'| the output cardinality from the left plan node, which is already available.
      • |M'| the output cardinality from the right scan node, which is already available. (In Impala, the right node will always be a table scan.)
      • |D.fk'| the number of foreign key values in the filtered D' relation. (In relational theory terms, the cardinality of the domain of the foreign key after applying pushed-down selection operations.)

      See IMPALA-XXXX for a necessary adjustment to |M'| to handle predicates common to both sides.

      We only need to estimate |D.fk'|. We will work up to this step by step because the intermediate steps help explain the bug in the current code.

      Filtering of Master Table Only

      Let's start with the simplest case, filtering on only the master table:

      |D.fk'| = |D.fk| <= |M.pk|
      
      max(|D.fk'|, |M'|) = max(|M.pk|, |M'|) = |M.pk| = |M|
      

      So:

         
      |D ⋈ M'| = |D| * |M'| / |M|
      

      Intuitively: the probability of any detail row finding a match is simply the selectivity of the master filter, or |M'| / |M|.

      Filtering on the Foreign Key Column

      The next case is also fairly easy. Suppose we know that the left (detail) scan applied a filter. Impala presently uses an incorrect, but simplified, model for this case. (See IMPALA-XXXX for the correct model.)

      The cardinality of the foreign key is simply the result of applying that filter:

      |D.fk'| = |D.fk| * sel(f)
      

      Let us assume we do not filter the master table in this case, so:

      |D.fk'| < |D.fk| <= |M.pk|
      max(|D.fk'|, |M|) = |M|
      

      And:

      |D' ⋈ M| = |D'| * |M| / |M| = |D'|
      

      We assume "containment" from the S&S paper above: all the foreign keys have a matching primary key if |F.fk| <= |M.pk|. So, if we filter only on the detail table, all foreign keys find a match and the cardinality of the join is just the cardinality of the left input relation.

      Combined Left and Right Filtering

      Now, let's combine the master and detail filtering cases. In general, we will have filtering on both sides. The max in the expression automatically combines the cases:

      |D.fk'| = |D.fk| * |D'| / |D|
      
                      |D'| * |M'| 
      |D' ⋈ M'| = ------------------
                  max(|D.fk'|, |M'|)
      

      Intuitively, the number of rows is the Cartesian product divided by the larger of the number of keys in either table after the scan. Said another way, each detail row finds a master, unless filtering has removed so many masters that some detail rows find no match, in which case the probability of a match is |M'| / |D.fk'|.

      Compound Primary Keys

      The final complexity is to consider a compound key. That is:

      (D.fk1, Dfk2) --> (M.pk1, M.pk2)
      

      The foreign key pair points to a matching primary key pair. Here we consider only pairs, but the logic is the same for a compound key with any number of columns. Obviously, by the definition of a join in SQL, the number of keys on each side must be the same.

      If we know (from HMS metadata) that the above pairs are, in fact, the keys, then we can make a simplifying assumption:

      |(M.pk1, M.pk2)| = |M|
      
      |(D.fk1, Dfk2)| <= |(M.pk1, M.pk2)| = |M|
      

      If we don't know a-priori that a pair (k1, k1) is a foreign or primary key, then we can estimate its cardinality (assuming independence of values) as:

      |(k1, k2)| = |k1| * |k2|
      

      In the M:1 case, the primary key cardinality must be the same as table cardinality. If the two columns are completely independent, then:

      |(M.pk1, M.pk2)| = |M.pk1| * |M.pk2| = |M|
      

      More typically, there is some correlation between the columns so:

      |(M.pk1, M.pk2)| = |M| <= |M.pk1| * |M.pk2|
      

      Indeed, Impala uses the above relation to decide we have the FK/PK case. Said another way, if Impala follows the FK/PK logic, then we can simply assume that |M.pk| = |M| even if the key is compound.

      Compound Foreign Keys

      Foreign keys require a bit more thought. We could have a detail table with only one row. If we have many detail rows, the assumption of containment says that each detail record points to some master record, so:

      |D.fk| <= |M.pk|
      

      This lets us estimate the cardinality of a compound foreign key as:

      |(D.fk1, D.fk2)| = min( |D.fk1| * |D.fk2|, |M| )
      

      That is, if the detail table is small, the left term is a reasonable estimate (ignoring the urn model issue). But, as the table gets larger, and begins to include most primary keys, we know that the number of foreign keys can't be larger than the number of primary keys, so the right term is the better estimate.

      Filtering on Compound Keys

      Suppose that the join inputs have filtering applied to the left (detail) table. We discussed how to handle this fo a single column. For a compound key, we can observe that the cardinality of the key is the product of the cardinality of the columns, but (using the containment assumption), no larger than the cardinality of the primary key (which is the cardinality of the master table.)

      So:

      |D.fk| = min( ∏ |D.fki|, |M| )
                                 
      |D.fk'| = |D.fk| * |D'| / |D|
      
              = min( ∏ |D.fki|, |M| ) * |D'| / |D|
      

      So the final expression for join cardinality is:

      |D.fk'| = min( ∏ |D.fki|, |M| ) * |D'| / |D|
      
      
                      |D'| * |M'| 
      |D' ⋈ M'| = -------------------
                  max( |D.fk'|, |M'|)
      

      The first expression says that the compound foreign key cardinality is either the product of the columns that make up the key, or the cardinality of the primary key, whichever is less. We then adjust that amount (incorrectly) by the percentage of the detail table that is scanned.

      The second expression is just the Cartesian product divided by the largest key cardinality: either foreign key or primary key (which is, by definition, equal to the cardinality of the master table.)

      Complexity: Compound Joins

      The above expression works well if the left input to a join is a base table row which we know the original table cardinality that corresponds to the original column NDV. The above can be used as-is in a M:N ("generic") join for a right-side table. But, when when used on the left side, we must recall that Impala builds left-deep join plans, so the left side may be a join. In this case, there is no original left-side base table.

      See IMPALA-XXXX for discussion of the complexities in this case. A quick and dirty solution is to use the scan output cardinality in place of the left input cardinality. That is:

      • Determine the table that contains the join key.
      • Search the left subtree for the scan for that table.

      To do this, start with the left input:

      • If the node is a scan node, and the scan is for the target table, return the output cardinality of that scan.
      • If the node is a join, then apply the search to both sides of the join.

      This approach is cumbersome, and may run into complexities if the node is something other than a scan or join. It also may underestimate if a filter is applied at the join level.

      A cleaner, tough more involved, solution is to track adjusted NDV for each column through each operator as described in IMPALA-8220.

      Code Bug

      The commit "IMPALA-5547: Rework FK/PK join detection", ID 9f678a74269250bf5c7ae2c5e8afd93c5b3734de on Jun 6, 2017 reworked the FK/PK logic. It has one flaw: after determining that we have an FK/PK (M:1) case, it then attempts to adjust the compound FK and PK columns. This has two problems:

      • It is unnecessary, as we saw above.
      • The math is wrong and produces bogus estimates.

      Here is the code in JoinNode.java:

          long result = -1;
          for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
            // Adjust the join selectivity based on the NDV ratio to avoid underestimating
            // the cardinality if the PK side has a higher NDV than the FK side.
            double ndvRatio = 1.0;
            if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
            double rhsSelectivity = Double.MIN_VALUE;
            if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / slots.rhsNumRows();
            long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
            if (result == -1) {
              result = joinCard;
            } else {
              result = Math.min(result, joinCard);
            }
          }
          // FK/PK join cardinality must be <= the lhs cardinality.
          result = Math.min(result, lhsCard);
      

      The above is hard to follow, which may account for why the bug was not caught. Ignoring some corner cases, the logic is essentially:

      lhsCard = |D'|
      
      ndvRatioi = |D.fki| / |M.pki|
      
      rhsSelectivityi = |M'| / |M|
      
      joinCardi = lhsCard * ndvRatioi * rhsSelectivityi
      
                = |D'| * (|D.fki| / |M.pki|) * |M'| / |M|
      
      |join| = min(joinCardi)
      
             = (|D'| * |M'| / |M|) * min(|D.fki| / |M.pki|)
      

      Though hard to see, the above is not equivalent to the logic worked out in the previous section. Using the correct expression from earlier sections:

                                  |D'| * |M'| 
      |D' ⋈ M'| = ----------------------------------------------
                  max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
      
      

      We can factor out the common |D'| * |M'| terms and compare:

      min(|D.fki| / |M.pki|)                     1
      ----------------------  !=  ----------------------------------------------
              |M|                 max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
      
      
                   1                                  1
      ----------------------------  !=  ----------------------------------------------
      |M| * max(|M.pki| / |D.fki|)      max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
      
      
      |M| * max(|M.pki| / |D.fki|)  !=  max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
      

      There is no valid operation that can convert one side to the other, so they are unequal.

      It is likely that the code's version attempts to work around issues elsewhere in the calculations (such as ignoring some predicates, using exponential back-off for filters, not having a good estimate for |D|, etc.)

      Longer-Term Fix

      The above simple fix is the target of this ticket. Longer term, the code should evolve to use a single path for both the M:1 and M:N cases since as described in IMPALA-8018. (Both cases start with HMS data. Currently we use two paths to arrive at the same result. IMPALA-8018 suggests we need only one path.) We should also adopt the simple urn model as described in IMPALA-8218.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Paul.Rogers Paul Rogers
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: