Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3167

Incorrect assignment of WHERE clause predicate through a grouping aggregation + outer join.

    XMLWordPrintableJSON

Details

    Description

      Thanks to sobrien05 from the user list for reporting this!
      Original post:
      http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Predict-push-down-bug/m-p/38516

      Repro tables and data:

      create table orders(order_id int, address_id int);
      insert into table orders values((1, 10), (2, 20), (3, 30));
      create table addresses(address_id int, state string);
      insert into table addresses values((10, "CA"), (20, "NY"), (30, "CA"));
      

      Repro query and bad plan:

      Select *
      FROM (
          Select
            shipping_addr.state,
            count(order_id) as num_orders
          FROM
          (
              Select
                order_id,
                address_id
              FROM orders
          ) order_data
          LEFT OUTER JOIN [shuffle]
          (
              Select address_id,
                    state
              FROM addresses
          ) shipping_addr
          ON order_data.address_id = shipping_addr.address_id
          GROUP BY
            shipping_addr.state
      ) unf
      WHERE
         lower(state) = lower('NY')
      
      +------------------------------------------------------------------------------------+
      | Explain String                                                                     |
      +------------------------------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=2.16GB VCores=2                            |
      | WARNING: The following tables are missing relevant table and/or column statistics. |
      | debug.addresses, debug.orders                                                      |
      |                                                                                    |
      | 08:EXCHANGE [UNPARTITIONED]                                                        |
      | |                                                                                  |
      | 07:AGGREGATE [FINALIZE]                                                            |
      | |  output: count:merge(order_id)                                                   |
      | |  group by: shipping_addr.state                                                   |
      | |                                                                                  |
      | 06:EXCHANGE [HASH(shipping_addr.state)]                                            |
      | |                                                                                  |
      | 03:AGGREGATE                                                                       |
      | |  output: count(order_id)                                                         |
      | |  group by: state                                                                 |
      | |                                                                                  |
      | 02:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]                                        |
      | |  hash predicates: address_id = address_id                                        |
      | |                                                                                  |
      | |--05:EXCHANGE [HASH(address_id)]                                                  |
      | |  |                                                                               |
      | |  01:SCAN HDFS [debug.addresses]                                                  |
      | |     partitions=1/1 files=1 size=18B                                              |
      | |     predicates: lower(debug.addresses.state) = lower('NY')       <-- should also be assigned at node 02                |
      | |                                                                                  |
      | 04:EXCHANGE [HASH(address_id)]                                                     |
      | |                                                                                  |
      | 00:SCAN HDFS [debug.orders]                                                        |
      |    partitions=1/1 files=1 size=15B                                                 |
      +------------------------------------------------------------------------------------+
      

      Workaround
      Use an INNER JOIN instead of a LEFT OUTER JOIN. The query still has the same meaning.

      Attachments

        Activity

          People

            alex.behm Alexander Behm
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: