Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-8040

Incorrect double-accounting for WHERE on join column

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.1.0
    • None
    • Frontend
    • None
    • ghx-label-4

    Description

      When a column appears as both a WHERE clause entry and a join condition, the planner incorrectly reduces estimated join cardinality twice for the same column.

      To see this, start with the simple case, join and WHERE on different columns.

      select m.c_custkey, d.o_custkey
      from tpch.customer m,
           tpch.orders d
      where m.c_custkey = d.o_custkey
        and m.c_name = 'foo'
      

      Here, c_name is unique, so only one row matches the filter. There are, on average, 10 rows per customer, so joining the one customer with its orders should return 10 rows.

      Plan:

      02:HASH JOIN [INNER JOIN]
      |  hash predicates: d.o_custkey = m.c_custkey
      |  fk/pk conjuncts: d.o_custkey = m.c_custkey
      |  tuple-ids=1,0 row-size=46B cardinality=16
      |
      |--00:SCAN HDFS [tpch.customer m]
      |     partitions=1/1 files=1 size=23.08MB
      |     predicates: m.c_name = 'foo'
      |     tuple-ids=0 row-size=38B cardinality=1
      |
      01:SCAN HDFS [tpch.orders d]
         partitions=1/1 files=1 size=162.56MB
         runtime filters: RF000[bloom] -> d.o_custkey
         tuple-ids=1 row-size=8B cardinality=1500000
      

      Notice the join cardinality of 16. (Should be 10, see IMPALA-8014).

      Now, do basically the same thing, but select that one customer base on ID, not name.

      select m.c_custkey, d.o_custkey
      from tpch.customer m,
           tpch.orders d
      where m.c_custkey = d.o_custkey
        and m.c_custkey = 10
      

      Logically, we still get one customer and that customer still joins with its 10 orders. But this time, we get the following plan:

      02:HASH JOIN [INNER JOIN]
      |  hash predicates: d.o_custkey = m.c_custkey
      |  fk/pk conjuncts: d.o_custkey = m.c_custkey
      |  tuple-ids=1,0 row-size=16B cardinality=1
      |
      |--00:SCAN HDFS [tpch.customer m]
      |     partitions=1/1 files=1 size=23.08MB
      |     predicates: m.c_custkey = CAST(10 AS BIGINT)
      |     tuple-ids=0 row-size=8B cardinality=1
      |
      01:SCAN HDFS [tpch.orders d]
         partitions=1/1 files=1 size=162.56MB
         predicates: d.o_custkey = CAST(10 AS BIGINT)
         tuple-ids=1 row-size=8B cardinality=15
      

      Notice that, in the orders scan, the planner has applied the m.c_custkey = 10 to reduce the scan count, which is good. (Though, the cardinality estimate is wrong, it should be 10 for TPC-H.)

      Then, we use the normal join calcs to assume that those 15 rows are matched against the set of 150K customers. Since there is a very low probability of a match, we guess about 1 row.

      The problem is, we are using the same column twice. If we already reduced the orders scan by the one ID we are looking for, we can't then assume, in the join, that we have the full range of IDs.

      The correct join cardinality would still be 10 if we handled this case correctly.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Paul.Rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: