Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2796

Suboptimal join ordering due to greedy candidate selection.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • Impala 2.2, Impala 2.3.0
    • None
    • Frontend

    Description

      Consider the join order that we generate for TPCH-Q8 after fixing IMPALA-976:

      hash join *region* on n1.n_regionkey = r_regionkey
      hash join nation n1 on c_nationkey = n1.n_nationkey
      hash join customer o_custkey = c_custkey
      hash join nation n2 s_nationkey = n2.c_nationkey
      hash join supplier on l_suppkey = s_suppkey
      hash join *order* l_orderkey = o_orderkey
      hash join *part* l_partkey = p_partkey
      lineitem
      

      In the pseudo-plan above, the tables in bold have selective predicates applied to the scans.
      Contrast the plan above with the following optimal join order:

      hash join nation n2 s_nationkey = n2.c_nationkey
      hash join supplier on l_suppkey = s_suppkey
      hash join *region* on n1.n_regionkey = r_regionkey
      hash join nation n1 on c_nationkey = n1.n_nationkey
      hash join customer o_custkey = c_custkey
      hash join *order* l_orderkey = o_orderkey
      hash join *part* l_partkey = p_partkey
      lineitem
      

      This plan is better because the number of intermediate results are reduced by executing the join on region first. The difference between the two plans is that the following series of joins are "swapped":

      This series of joins leads up to a selective join with region, and should come before the block with supplier and n2.

      hash join *region* on n1.n_regionkey = r_regionkey
      hash join nation n1 on c_nationkey = n1.n_nationkey
      hash join customer o_custkey = c_custkey
      

      These series of joins are not selective and should come last.

      hash join nation n2 s_nationkey = n2.c_nationkey
      hash join supplier on l_suppkey = s_suppkey
      

      Our current join-ordering algorithm is not able to produce the optimal plan because it greedily adds one join at a time. After it has constructed the partial plan that joins lineitem,part,order, the algorithm considers customer and supplier as candidates for the next join (only these tables are candidates due to the applicable join predicates). Since the resulting join cardinality for customer and supplier is estimated to be equal, we "arbitrarily" pick one and continue.

      We should improve our join ordering algorithm to generate the optimal join order for TPCH-Q8.

      Attachments

        Activity

          People

            Unassigned Unassigned
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: