Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8217

Remove scan operator exponential backoff in cardinality calcs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.1.0
    • None
    • Frontend
    • None
    • ghx-label-3

    Description

      The planner's scan nodes use an exponential back-off to compute the joint filter selectivity:

          double result = 1.0;
          for (int i = 0; i < selectivities.size(); ++i) {
            // Exponential backoff for each selectivity multiplied into the final result.
            result *= Math.pow(selectivities.get(i), 1.0 / (double) (i + 1));
          }
      

      The reasoning is:

        /*
         * 2. Two selectivities, whether known or unknown, could be correlated. Assuming
         *    independence can lead to significant underestimation.
         *
         * The second issue is addressed by an exponential backoff when multiplying each
         * additional selectivity into the final result.
         */
      

      The logic makes some kind of sense, especially since Impala calculates selectivity only for one kind of predicate: col = const, and ignores selectivity for all others. (Actually, replaces all others with a single combined selectivity estimate of 0.1.}}

      The problem, however, is that this violates the commutative and linear models that underly relational theory. In general, relational theory says it does not matter if we join then filter, or filter, then join. If we have two filters, f1 and f2, it does not matter if we apply f1 in the scan and f2 after the join: the resulting result set is the same. (Performance differs, but we ignore that when computing cardinality.)

      With exponential back-off, this no longer applies. It becomes difficult, say, to address the issue in IMPALA-XXXX in which we need to apply filter selectivity at the scan level, then remove it again at the join level. The reason is that the cardinality applied at the scan level may have been backed off so simply using the expression's report of its cardinality may not be the value used by the scan.

      Also, oddly, the same expression could apply different selectivity in different scans. The code sorts selectivities in ascending order. So, in scan 1, filter f1 might apply its full selectivity, while in scan 2 (assuming a correlated filter), it might apply the square or cube of the selectivity.

      Solution

      The solution here is simply to:

      • Compute the selectivity for all predicates so we don't have to fudge.
      • Remove correlated predicates from join calcs (See IMPALA-XXXX).
      • Remove exponential back-off calculations so that filter selectivity can be properly decomposed for the above task.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Paul.Rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: