XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.8.0
    • Impala 2.10.0
    • Frontend

    Description

      When working on IMPALA-4174 realized that the planner doesn't handle cardinality estimation for multi column joins.

      Another variation of the same problem is when the primary key is consistent of multiple columns, for TPC-DS store_sales the primary key is (ss_ticket_number, ss_item_sk) and for store_returns the primary key is (sr_ticket_number, sr_item_sk), so the join condition below is a PK/PK join and the current algorithm doesn't handle that case and the estimate for the join uses the many to many calculation which is incorrect

      SELECT count(*) 
      FROM   store_sales, 
             store_returns 
      WHERE  ss_customer_sk = sr_customer_sk 
             AND ss_item_sk = sr_item_sk 
             AND ss_ticket_number = sr_ticket_number 
      
      +-------------------------------------------------------------------------------------------------------------------+
      | Explain String                                                                                                    |
      +-------------------------------------------------------------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=1.15GB VCores=2                                                           |
      |                                                                                                                   |
      | PLAN-ROOT SINK                                                                                                    |
      | |                                                                                                                 |
      | 07:AGGREGATE [FINALIZE]                                                                                           |
      | |  output: count:merge(*)                                                                                         |
      | |  hosts=20 per-host-mem=unavailable                                                                              |
      | |  tuple-ids=2 row-size=8B cardinality=1                                                                          |
      | |                                                                                                                 |
      | 06:EXCHANGE [UNPARTITIONED]                                                                                       |
      | |  hosts=20 per-host-mem=unavailable                                                                              |
      | |  tuple-ids=2 row-size=8B cardinality=1                                                                          |
      | |                                                                                                                 |
      | 03:AGGREGATE                                                                                                      |
      | |  output: count(*)                                                                                               |
      | |  hosts=20 per-host-mem=10.00MB                                                                                  |
      | |  tuple-ids=2 row-size=8B cardinality=1                                                                          |
      | |                                                                                                                 |
      | 02:HASH JOIN [INNER JOIN, PARTITIONED]                                                                            |
      | |  hash predicates: ss_customer_sk = sr_customer_sk, ss_item_sk = sr_item_sk, ss_ticket_number = sr_ticket_number |
      | |  runtime filters: RF000 <- sr_customer_sk, RF001 <- sr_item_sk, RF002 <- sr_ticket_number                       |
      | |  hosts=20 per-host-mem=906.36MB                                                                                 |
      | |  tuple-ids=0,1 row-size=40B cardinality=15477599674736                                                          |
      | |                                                                                                                 |
      | |--05:EXCHANGE [HASH(sr_customer_sk,sr_item_sk,sr_ticket_number)]                                                 |
      | |  |  hosts=20 per-host-mem=0B                                                                                    |
      | |  |  tuple-ids=1 row-size=20B cardinality=863989652                                                              |
      | |  |                                                                                                              |
      | |  01:SCAN HDFS [tpcds_3000_parquet.store_returns, RANDOM]                                                        |
      | |     partitions=2004/2004 files=2008 size=48.92GB                                                                |
      | |     table stats: 863989652 rows total                                                                           |
      | |     column stats: all                                                                                           |
      | |     hosts=20 per-host-mem=120.00MB                                                                              |
      | |     tuple-ids=1 row-size=20B cardinality=863989652                                                              |
      | |                                                                                                                 |
      | 04:EXCHANGE [HASH(ss_customer_sk,ss_item_sk,ss_ticket_number)]                                                    |
      | |  hosts=20 per-host-mem=0B                                                                                       |
      | |  tuple-ids=0 row-size=20B cardinality=8639936081                                                                |
      | |                                                                                                                 |
      | 00:SCAN HDFS [tpcds_3000_parquet.store_sales, RANDOM]                                                             |
      |    partitions=1824/1824 files=2649 size=376.48GB                                                                  |
      |    runtime filters: RF000 -> ss_customer_sk, RF001 -> ss_item_sk, RF002 -> ss_ticket_number                       |
      |    table stats: 8639936081 rows total                                                                             |
      |    column stats: all                                                                                              |
      |    hosts=20 per-host-mem=264.00MB                                                                                 |
      |    tuple-ids=0 row-size=20B cardinality=8639936081                                                                |
      +-------------------------------------------------------------------------------------------------------------------+
      

      Attachments

        Issue Links

          Activity

            People

              alex.behm Alexander Behm
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: