Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.10.0
    • Component/s: Frontend
    • Labels:

      Description

      When working on IMPALA-4174 realized that the planner doesn't handle cardinality estimation for multi column joins.

      Another variation of the same problem is when the primary key is consistent of multiple columns, for TPC-DS store_sales the primary key is (ss_ticket_number, ss_item_sk) and for store_returns the primary key is (sr_ticket_number, sr_item_sk), so the join condition below is a PK/PK join and the current algorithm doesn't handle that case and the estimate for the join uses the many to many calculation which is incorrect

      SELECT count(*) 
      FROM   store_sales, 
             store_returns 
      WHERE  ss_customer_sk = sr_customer_sk 
             AND ss_item_sk = sr_item_sk 
             AND ss_ticket_number = sr_ticket_number 
      
      +-------------------------------------------------------------------------------------------------------------------+
      | Explain String                                                                                                    |
      +-------------------------------------------------------------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=1.15GB VCores=2                                                           |
      |                                                                                                                   |
      | PLAN-ROOT SINK                                                                                                    |
      | |                                                                                                                 |
      | 07:AGGREGATE [FINALIZE]                                                                                           |
      | |  output: count:merge(*)                                                                                         |
      | |  hosts=20 per-host-mem=unavailable                                                                              |
      | |  tuple-ids=2 row-size=8B cardinality=1                                                                          |
      | |                                                                                                                 |
      | 06:EXCHANGE [UNPARTITIONED]                                                                                       |
      | |  hosts=20 per-host-mem=unavailable                                                                              |
      | |  tuple-ids=2 row-size=8B cardinality=1                                                                          |
      | |                                                                                                                 |
      | 03:AGGREGATE                                                                                                      |
      | |  output: count(*)                                                                                               |
      | |  hosts=20 per-host-mem=10.00MB                                                                                  |
      | |  tuple-ids=2 row-size=8B cardinality=1                                                                          |
      | |                                                                                                                 |
      | 02:HASH JOIN [INNER JOIN, PARTITIONED]                                                                            |
      | |  hash predicates: ss_customer_sk = sr_customer_sk, ss_item_sk = sr_item_sk, ss_ticket_number = sr_ticket_number |
      | |  runtime filters: RF000 <- sr_customer_sk, RF001 <- sr_item_sk, RF002 <- sr_ticket_number                       |
      | |  hosts=20 per-host-mem=906.36MB                                                                                 |
      | |  tuple-ids=0,1 row-size=40B cardinality=15477599674736                                                          |
      | |                                                                                                                 |
      | |--05:EXCHANGE [HASH(sr_customer_sk,sr_item_sk,sr_ticket_number)]                                                 |
      | |  |  hosts=20 per-host-mem=0B                                                                                    |
      | |  |  tuple-ids=1 row-size=20B cardinality=863989652                                                              |
      | |  |                                                                                                              |
      | |  01:SCAN HDFS [tpcds_3000_parquet.store_returns, RANDOM]                                                        |
      | |     partitions=2004/2004 files=2008 size=48.92GB                                                                |
      | |     table stats: 863989652 rows total                                                                           |
      | |     column stats: all                                                                                           |
      | |     hosts=20 per-host-mem=120.00MB                                                                              |
      | |     tuple-ids=1 row-size=20B cardinality=863989652                                                              |
      | |                                                                                                                 |
      | 04:EXCHANGE [HASH(ss_customer_sk,ss_item_sk,ss_ticket_number)]                                                    |
      | |  hosts=20 per-host-mem=0B                                                                                       |
      | |  tuple-ids=0 row-size=20B cardinality=8639936081                                                                |
      | |                                                                                                                 |
      | 00:SCAN HDFS [tpcds_3000_parquet.store_sales, RANDOM]                                                             |
      |    partitions=1824/1824 files=2649 size=376.48GB                                                                  |
      |    runtime filters: RF000 -> ss_customer_sk, RF001 -> ss_item_sk, RF002 -> ss_ticket_number                       |
      |    table stats: 8639936081 rows total                                                                             |
      |    column stats: all                                                                                              |
      |    hosts=20 per-host-mem=264.00MB                                                                                 |
      |    tuple-ids=0 row-size=20B cardinality=8639936081                                                                |
      +-------------------------------------------------------------------------------------------------------------------+
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                alex.behm Alexander Behm
                Reporter:
                mmokhtar Mostafa Mokhtar
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: