Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10769

Bad join order in TPC-DS Q78 caused by bad cardinality estimate in aggregation

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 4.0.0
    • Fix Version/s: None
    • Component/s: Frontend
    • Labels:
    • Epic Color:
      ghx-label-12

      Description

      The plan for TPC-DS Q78 uses a bad join order with the largest input on the RHS, caused by a bad underestimate of the cardinality of the aggregation node in that join input.

      Summary below (full profile including query and plan attached):

      F15:ROOT                            1      1    0.000ns    0.000ns                         4.01 MB        4.00 MB
      38:MERGING-EXCHANGE                 1      1    0.000ns    0.000ns      100         100    2.81 MB        1.15 MB  UNPARTITIONED
      F04:EXCHANGE SENDER                10    120   33.333us    3.999ms                         1.31 KB              0
      20:TOP-N                           10    120   32.900ms  264.000ms   12.00K         100   52.00 KB        9.38 KB
      19:HASH JOIN                       10    120    1s694ms    2s044ms  312.51K       8.80M   55.12 KB              0  RIGHT OUTER JOIN, PARTITIONED
      |--F16:JOIN BUILD                  10    120    3s735ms    5s052ms                         1.20 GB       17.00 MB
      |  37:EXCHANGE                     10    120  634.767ms    1s004ms    1.44B       8.80M   15.80 MB       25.00 MB  HASH(d_year,ss_item_sk,ss_customer_sk)
      |  F09:EXCHANGE SENDER             10    120    4s736ms    5s179ms                       279.25 KB              0
      |  18:HASH JOIN                    10    120  979.068ms    1s140ms    1.44B       8.80M   43.12 KB              0  RIGHT OUTER JOIN, PARTITIONED
      |  |--F17:JOIN BUILD               10    120    5s370ms    6s560ms                         1.19 GB        8.50 MB
      |  |  36:EXCHANGE                  10    120  390.834ms  640.000ms    1.44B       8.80M    7.20 MB       17.50 MB  HASH(d_year,ss_item_sk,ss_customer_sk)
      |  |  F14:EXCHANGE SENDER          10    120    3s330ms    4s148ms                       279.25 KB              0
      |  |  35:AGGREGATE                 10    120    6s362ms    9s508ms    1.44B       8.80M  960.04 MB        1.02 GB  FINALIZE
      |  |  34:EXCHANGE                  10    120  517.901ms  788.005ms    1.46B       1.72B   14.31 MB       17.50 MB  HASH(d_year,ss_item_sk,ss_customer_sk)
      |  |  F12:EXCHANGE SENDER          10    120    8s196ms    9s168ms                         3.53 MB              0
      |  |  05:AGGREGATE                 10    120   17s890ms   22s904ms    1.46B       1.72B  992.05 MB      780.95 MB  STREAMING
      |  |  04:HASH JOIN                 10    120  512.300ms  672.001ms    1.48B       1.72B   37.12 KB              0  INNER JOIN, BROADCAST
      |  |  |--F18:JOIN BUILD            10     10   32.400ms   72.000ms                        23.27 MB       23.25 MB
      |  |  |  33:EXCHANGE               10     10    0.000ns    0.000ns      365         373   16.00 KB       16.00 KB  BROADCAST
      |  |  |  F13:EXCHANGE SENDER        1      1    0.000ns    0.000ns                        98.44 KB              0
      |  |  |  02:SCAN S3                 1      1    8.000ms    8.000ms      365         373    1.62 MB       16.00 MB  tpcds_3000_decimal_parquet.date_dim
      |  |  03:HASH JOIN                 10    120    3s542ms    4s200ms    1.48B       8.64B   42.12 KB              0  LEFT OUTER JOIN, PARTITIONED
      |  |  |--F19:JOIN BUILD            10    120    1s178ms    1s432ms                       384.03 MB      382.93 MB
      |  |  |  32:EXCHANGE               10    120  138.600ms  200.000ms  863.99M     863.99M    4.35 MB       12.34 MB  HASH(sr_item_sk,sr_ticket_number)
      |  |  |  F11:EXCHANGE SENDER       10    120    1s982ms    2s628ms                         4.16 MB              0
      |  |  |  01:SCAN S3                10    120  385.534ms  628.000ms  863.99M     863.99M   48.97 MB       40.00 MB  tpcds_3000_decimal_parquet.store_returns
      |  |  31:EXCHANGE                  10    120  787.434ms    2s316ms    1.64B       8.64B   14.32 MB       15.62 MB  HASH(ss_item_sk,ss_ticket_number)
      |  |  F10:EXCHANGE SENDER          10    120    7s394ms   11s732ms                         3.61 MB              0
      |  |  00:SCAN S3                   10    120    2s575ms    4s356ms    1.64B       8.64B   98.15 MB       88.00 MB  tpcds_3000_decimal_parquet.store_sales
      |  30:AGGREGATE                    10    120    8s492ms   15s428ms  386.62M     430.99M  324.04 MB      260.31 MB  FINALIZE
      |  29:EXCHANGE                     10    120  257.634ms    1s024ms  386.62M     430.99M   14.32 MB       17.50 MB  HASH(d_year,ws_item_sk,ws_bill_customer_sk)
      |  F07:EXCHANGE SENDER             10    120    4s837ms   10s996ms                         3.53 MB              0
      |  11:AGGREGATE                    10    120    3s005ms    4s536ms  386.62M     430.99M   82.05 MB      195.24 MB  STREAMING
      |  10:HASH JOIN                    10    120  158.866ms  252.000ms  386.62M     430.99M   37.12 KB              0  INNER JOIN, BROADCAST
      |  |--F20:JOIN BUILD               10     10   93.600ms  168.000ms                        23.27 MB       23.25 MB
      |  |  28:EXCHANGE                  10     10    0.000ns    0.000ns      365         373   16.00 KB       16.00 KB  BROADCAST
      |  |  F08:EXCHANGE SENDER           1      1    0.000ns    0.000ns                        98.44 KB              0
      |  |  08:SCAN S3                    1      1   10s172ms   10s172ms      365         373    1.62 MB       16.00 MB  tpcds_3000_decimal_parquet.date_dim
      |  09:HASH JOIN                    10    120  880.001ms    1s096ms  386.63M       2.16B   42.12 KB              0  LEFT OUTER JOIN, PARTITIONED
      |  |--F21:JOIN BUILD               10    120  632.367ms    1s383ms                        96.03 MB       95.73 MB
      |  |  27:EXCHANGE                  10    120   34.266ms  108.000ms  216.00M     216.00M    1.09 MB       12.34 MB  HASH(wr_item_sk,wr_order_number)
      |  |  F06:EXCHANGE SENDER          10    120    1s331ms    2s823ms                         4.16 MB              0
      |  |  07:SCAN S3                   10    120   11s708ms   12s496ms  216.00M     216.00M   48.92 MB       24.00 MB  tpcds_3000_decimal_parquet.web_returns
      |  26:EXCHANGE                     10    120  348.834ms    2s176ms  429.58M       2.16B   14.38 MB       15.62 MB  HASH(ws_item_sk,ws_order_number)
      |  F05:EXCHANGE SENDER             10    120    2s616ms    6s356ms                         3.61 MB              0
      |  06:SCAN S3                      10    120   12s429ms   16s108ms  429.58M       2.16B   47.53 MB       88.00 MB  tpcds_3000_decimal_parquet.web_sales
      25:AGGREGATE                       10    120   10s950ms   16s904ms  767.74M     855.91M  648.04 MB      516.96 MB  FINALIZE
      24:EXCHANGE                        10    120  445.234ms    1s184ms  769.46M     855.91M   14.32 MB       17.50 MB  HASH(d_year,cs_item_sk,cs_bill_customer_sk)
      F02:EXCHANGE SENDER                10    120    5s297ms    8s900ms                         3.53 MB              0
      17:AGGREGATE                       10    120    5s660ms    8s160ms  769.46M     855.91M  128.05 MB      387.72 MB  STREAMING
      16:HASH JOIN                       10    120  319.367ms  460.000ms  769.49M     855.91M   37.12 KB              0  INNER JOIN, BROADCAST
      |--F22:JOIN BUILD                  10     10   67.600ms  179.999ms                        23.27 MB       23.25 MB
      |  23:EXCHANGE                     10     10    0.000ns    0.000ns      365         373   16.00 KB       16.00 KB  BROADCAST
      |  F03:EXCHANGE SENDER              1      1    0.000ns    0.000ns                        98.44 KB              0
      |  14:SCAN S3                       1      1   10s244ms   10s244ms      365         373    1.62 MB       16.00 MB  tpcds_3000_decimal_parquet.date_dim
      15:HASH JOIN                       10    120    1s920ms    2s264ms  769.50M       4.32B   42.12 KB              0  LEFT OUTER JOIN, PARTITIONED
      |--F23:JOIN BUILD                  10    120    1s448ms    1s832ms                       192.03 MB      191.47 MB
      |  22:EXCHANGE                     10    120   64.933ms  136.000ms  432.02M     432.02M    6.37 MB       12.34 MB  HASH(cr_item_sk,cr_order_number)
      |  F01:EXCHANGE SENDER             10    120    1s830ms    2s568ms                         4.16 MB              0
      |  13:SCAN S3                      10    120   11s753ms   12s600ms  432.02M     432.02M    8.92 MB       32.00 MB  tpcds_3000_decimal_parquet.catalog_returns
      21:EXCHANGE                        10    120  592.967ms    1s720ms  855.00M       4.32B   14.28 MB       15.62 MB  HASH(cs_item_sk,cs_order_number)
      F00:EXCHANGE SENDER                10    120    4s149ms    5s608ms                         3.61 MB              0
      12:SCAN S3                         10    120   13s398ms   18s564ms  855.00M       4.32B   57.61 MB       88.00 MB  tpcds_3000_decimal_parquet.catalog_sales
      

      The problematic aggregation is plan node 35 which has an estimated output of 8.8M rows and an actual output of 1.44B.

      AggregateNode.java computes the cardinality by multiplying the NDV of the grouping expressions, then capping this value at the number of input rows and then finally applying the conjuncts. The conjunct in this case is the filter d_year=2001 which has selectivity 1/196=0.0051. The fact that we first cap the cardinality at the input size and then apply a very selective predicate to the already capped value results in a bad underestimate in this case.

      It might be a better approach to apply the conjunct selectivity to the NDV product first and then cap the result of that based on the number of input rows, to reduce the risk of underestimation like we see in this case. Earlier versions of the code did did the operations in this order (apply conjuncts first, then cap based on input rows) but the order of these operations was reversed by this commit. It's not clear whether that commit intended to change the order of the conjunct application and the cap or if that was just an unintended side effect of code refactoring (maybe Tim Armstrong can clarify?).

      I ran the same query with a straight_join hint to force the correct join order and got a 40% reduction in execution time. I also ran the query (unmodified) on an older version of the code prior to the commit that reversed the order of the conjunct application/cap and the resulting plan has a much better cardinality (see plan node 25) and correct join order (note this is a different cluster, larger scale factor, no mt_dop so don't directly compare the numbers to the more recent profile):

      38:MERGING-EXCHANGE        1  771.965us  771.965us      100         100  456.00 KB              0  UNPARTITIONED
      20:TOP-N                  19   50.204ms   71.974ms    1.90K         100   52.00 KB        9.38 KB
      19:HASH JOIN              19       1m4s      1m33s    1.26M       5.75B   11.66 GB        9.23 GB  LEFT OUTER JOIN, PARTITIONED
      |--37:EXCHANGE            19    3s219ms    3s969ms    2.56B       2.85B    7.43 MB              0  HASH(d_year,cs_item_sk,cs_bill_customer_sk)
      |  36:AGGREGATE           19      1m22s      1m31s    2.56B       2.85B   11.66 GB      175.36 GB  FINALIZE
      |  35:EXCHANGE            19    3s347ms    3s845ms    2.56B       2.85B   13.53 MB              0  HASH(d_year,cs_item_sk,cs_bill_customer_sk)
      |  17:AGGREGATE           19   46s016ms   53s791ms    2.56B       2.85B  992.06 MB      175.36 GB  STREAMING
      |  16:HASH JOIN           19    3s512ms    3s836ms    2.56B       2.85B    1.99 MB        1.94 MB  INNER JOIN, BROADCAST
      |  |--34:EXCHANGE         19   14.477us   24.595us      365         373   18.97 KB              0  BROADCAST
      |  |  14:SCAN HDFS         1   51.493ms   51.493ms      365         373    1.13 MB       32.00 MB  tpcds_10000_decimal_parquet.date_dim
      |  15:HASH JOIN           19   53s103ms       1m8s    2.57B      14.40B    3.22 GB        1.24 GB  LEFT OUTER JOIN, PARTITIONED
      |  |--33:EXCHANGE         19    1s193ms    1s712ms    1.44B       1.44B   10.49 MB              0  HASH(cr_item_sk,cr_order_number)
      |  |  13:SCAN HDFS        19  325.397ms  359.082ms    1.44B       1.44B  127.32 MB      144.00 MB  tpcds_10000_decimal_parquet.catalog_returns
      |  32:EXCHANGE            19    3s019ms    3s927ms    2.85B      14.40B   13.22 MB              0  HASH(cs_item_sk,cs_order_number)
      |  12:SCAN HDFS           19    2s021ms    3s140ms    2.85B      14.40B  355.82 MB      528.00 MB  tpcds_10000_decimal_parquet.catalog_sales
      18:HASH JOIN              19   40s974ms   55s991ms    4.30B       5.75B    5.88 GB        4.65 GB  LEFT OUTER JOIN, PARTITIONED
      |--31:EXCHANGE            19    1s731ms    1s864ms    1.29B       1.44B    7.16 MB              0  HASH(d_year,ws_item_sk,ws_bill_customer_sk)
      |  30:AGGREGATE           19   41s146ms   49s094ms    1.29B       1.44B    5.88 GB       88.31 GB  FINALIZE
      |  29:EXCHANGE            19    1s743ms    1s989ms    1.29B       1.44B   13.53 MB              0  HASH(d_year,ws_item_sk,ws_bill_customer_sk)
      |  11:AGGREGATE           19   18s910ms   20s163ms    1.29B       1.44B  128.06 MB       88.31 GB  STREAMING
      |  10:HASH JOIN           19    1s742ms    1s801ms    1.29B       1.44B    1.99 MB        1.94 MB  INNER JOIN, BROADCAST
      |  |--28:EXCHANGE         19   15.043us   24.887us      365         373   18.97 KB              0  BROADCAST
      |  |  08:SCAN HDFS         1   53.015ms   53.015ms      365         373    1.13 MB       32.00 MB  tpcds_10000_decimal_parquet.date_dim
      |  09:HASH JOIN           19   22s658ms   26s425ms    1.29B       7.20B    1.63 GB      636.07 MB  LEFT OUTER JOIN, PARTITIONED
      |  |--27:EXCHANGE         19  597.705ms  796.106ms  720.02M     720.02M   10.50 MB              0  HASH(wr_item_sk,wr_order_number)
      |  |  07:SCAN HDFS        19  185.858ms  315.385ms  720.02M     720.02M   98.54 MB       80.00 MB  tpcds_10000_decimal_parquet.web_returns
      |  26:EXCHANGE            19    1s530ms    2s739ms    1.43B       7.20B   13.23 MB              0  HASH(ws_item_sk,ws_order_number)
      |  06:SCAN HDFS           19    1s589ms    2s720ms    1.43B       7.20B  369.88 MB      528.00 MB  tpcds_10000_decimal_parquet.web_sales
      25:AGGREGATE              19      1m35s      1m51s    4.30B       5.75B   21.04 GB      353.23 GB  FINALIZE
      24:EXCHANGE               19    4s893ms    5s539ms    4.31B       5.75B   13.53 MB              0  HASH(d_year,ss_item_sk,ss_customer_sk)
      05:AGGREGATE              19      2m40s       3m8s    4.31B       5.75B   20.91 GB      353.23 GB  STREAMING
      04:HASH JOIN              19    6s349ms    7s056ms    4.92B       5.75B    1.99 MB        1.94 MB  INNER JOIN, BROADCAST
      |--23:EXCHANGE            19   15.995us   38.006us      365         373   18.97 KB              0  BROADCAST
      |  02:SCAN HDFS            1   49.584ms   49.584ms      365         373    1.13 MB       32.00 MB  tpcds_10000_decimal_parquet.date_dim
      03:HASH JOIN              19      1m46s      2m20s    4.92B      28.80B    6.41 GB        2.48 GB  LEFT OUTER JOIN, PARTITIONED
      |--22:EXCHANGE            19    2s227ms    3s111ms    2.88B       2.88B   10.49 MB              0  HASH(sr_item_sk,sr_ticket_number)
      |  01:SCAN HDFS           19  611.931ms  686.818ms    2.88B       2.88B  205.33 MB      176.00 MB  tpcds_10000_decimal_parquet.store_returns
      21:EXCHANGE               19    5s953ms    7s325ms    5.47B      28.80B   13.40 MB              0  HASH(ss_item_sk,ss_ticket_number)
      00:SCAN HDFS              19    2s520ms    3s222ms    5.47B      28.80B  492.66 MB      528.00 MB  tpcds_10000_decimal_parquet.store_sales
      

        Attachments

        1. q78_partial_profile.txt
          48 kB
          David Rorke

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                drorke David Rorke
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: