Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8262

Join cardinality not decreased by join filter selectivity

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.1.0
    • None
    • Frontend
    • None

    Description

      Consider a subset of the plan for TPC-H query 7. (See tpch-all.test for details.)

      11:AGGREGATE [FINALIZE]
      |  output: sum(l_extendedprice * (1 - l_discount))
      |  group by: n1.n_name, n2.n_name, year(l_shipdate)
      |  row-size=58B cardinality=575.77K
      |
      10:HASH JOIN [INNER JOIN]
      |  hash predicates: c_nationkey = n2.n_nationkey
      |  other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
      |  row-size=132B cardinality=575.77K
      |
      |--05:SCAN HDFS [tpch.nation n2]
      |     row-size=21B cardinality=25
      |
      09:HASH JOIN [INNER JOIN]
      |  hash predicates: s_nationkey = n1.n_nationkey
      |  row-size=111B cardinality=575.77K
      

      Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass along to the aggregate 11. Notice, however, that join 10 has a that picks out 2 of the 25 countries in each of two paths. The selectivity of the filters should be something like 2 * 2/25 = 0.16. Thus, the output cardinality of the 10 join should be 577K * 0.16 = 92K.

      The problem is that the join cardinality calculations don't consider join filter selectivity.

      It may be that this was done to handle the outer join case, in which filters applied in the outer-side scan must be re-applied on the join. Omitting the filters avoids duplicate accounting for the selectivity.

      But, that case is special and should be handled specially as part of IMPALA-8213. Except for correlated filters, the planner should apply join filter selectivity to the join output cardinality calculations.

      This error has consequences. The filter should reduce the number of rows though the join. Because it does so, it should come early in the join tree to reduce the set of rows processed. But, because selectivity is ignored, the planner does not see the join as a filter, and ends up putting the join 10 at the top of the join tree. (See the test file for the full plan.) The result is that Impala schleps around many more rows than necessary, only to discard them near the top of the DAG.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Paul.Rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: