Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4694

Cardinality estimation of joins is based on NDV and row count of base tables rather than sub-expression

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: None
    • Component/s: Frontend
    • Labels:

      Description

      With query below the planner checks the NDV and row count for catalog_returns and decides that the join between store_sales and catalog_returns is a many to many join and estimates that the join will produce 17 Billion rows.
      The join column cr_item_sk is actually unique from the subquery since there is a group by on cr_item_sk and the cardinality estimation algorithm used for the join should be a one to many.

      WITH cs_ui 
           AS (SELECT cr_item_sk 
               FROM   catalog_returns 
               GROUP  BY cr_item_sk) 
      SELECT Count(*) cnt 
      FROM   store_sales, 
             cs_ui 
      WHERE  ss_item_sk = cs_ui.cr_item_sk 
      
      +--------------------------------------------------------------+
      | Explain String                                               |
      +--------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=101.07MB VCores=2    |
      |                                                              |
      | PLAN-ROOT SINK                                               |
      | |                                                            |
      | 09:AGGREGATE [FINALIZE]                                      |
      | |  output: count:merge(*)                                    |
      | |  hosts=20 per-host-mem=unavailable                         |
      | |  tuple-ids=4 row-size=8B cardinality=1                     |
      | |                                                            |
      | 08:EXCHANGE [UNPARTITIONED]                                  |
      | |  hosts=20 per-host-mem=unavailable                         |
      | |  tuple-ids=4 row-size=8B cardinality=1                     |
      | |                                                            |
      | 04:AGGREGATE                                                 |
      | |  output: count(*)                                          |
      | |  hosts=20 per-host-mem=10.00MB                             |
      | |  tuple-ids=4 row-size=8B cardinality=1                     |
      | |                                                            |
      | 03:HASH JOIN [INNER JOIN, BROADCAST]                         |
      | |  hash predicates: ss_item_sk = cr_item_sk                  |
      | |  runtime filters: RF000 <- cr_item_sk                      |
      | |  hosts=20 per-host-mem=3.07MB                              |
      | |  tuple-ids=0,2 row-size=16B cardinality=17,966,906,063     |
      | |                                                            |
      | |--07:EXCHANGE [BROADCAST]                                   |
      | |  |  hosts=20 per-host-mem=0B                               |
      | |  |  tuple-ids=2 row-size=8B cardinality=365989             |
      | |  |                                                         |
      | |  06:AGGREGATE [FINALIZE]                                   |
      | |  |  group by: cr_item_sk                                   |
      | |  |  hosts=20 per-host-mem=10.00MB                          |
      | |  |  tuple-ids=2 row-size=8B cardinality=365989             |
      | |  |                                                         |
      | |  05:EXCHANGE [HASH(cr_item_sk)]                            |
      | |  |  hosts=20 per-host-mem=0B                               |
      | |  |  tuple-ids=2 row-size=8B cardinality=365989             |
      | |  |                                                         |
      | |  02:AGGREGATE [STREAMING]                                  |
      | |  |  group by: cr_item_sk                                   |
      | |  |  hosts=20 per-host-mem=10.00MB                          |
      | |  |  tuple-ids=2 row-size=8B cardinality=365989             |
      | |  |                                                         |
      | |  01:SCAN HDFS [tpcds_3000_parquet.catalog_returns, RANDOM] |
      | |     partitions=2104/2104 files=2104 size=33.03GB           |
      | |     table stats: 432018033 rows total                      |
      | |     column stats: all                                      |
      | |     hosts=20 per-host-mem=32.00MB                          |
      | |     tuple-ids=1 row-size=8B cardinality=432,018,033        |
      | |                                                            |
      | 00:SCAN HDFS [tpcds_3000_parquet.store_sales, RANDOM]        |
      |    partitions=1824/1824 files=2649 size=376.48GB             |
      |    runtime filters: RF000 -> ss_item_sk                      |
      |    table stats: 8639936081 rows total                        |
      |    column stats: all                                         |
      |    hosts=20 per-host-mem=88.00MB                             |
      |    tuple-ids=0 row-size=8B cardinality=8,639,936,081         |
      +--------------------------------------------------------------+
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mmokhtar Mostafa Mokhtar
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: