Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5851

Estimate number of rows for sum_init_zero scans should be number of files not table cardinality

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • None
    • Impala 4.3.0
    • Frontend
    • None
    • ghx-label-8

    Description

      IMPALA-5036 introduced an optimization to use the data stored in the Parquet RowGroup.num_rows field for count queries.
      The estimate cardinality for the scan is the number of rows in the base table opposed to number of files or row groups.

      +-------------------------------------------------------------------------------+
      | Explain String                                                                |
      +-------------------------------------------------------------------------------+
      | Max Per-Host Resource Reservation: Memory=0B                                  |
      | Per-Host Resource Estimates: Memory=108.00MB                                  |
      |                                                                               |
      | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                         |
      | |  Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B                |
      | PLAN-ROOT SINK                                                                |
      | |  mem-estimate=0B mem-reservation=0B                                         |
      | |                                                                             |
      | 03:AGGREGATE [FINALIZE]                                                       |
      | |  output: count:merge(*)                                                     |
      | |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB                |
      | |  tuple-ids=1 row-size=8B cardinality=1                                      |
      | |                                                                             |
      | 02:EXCHANGE [UNPARTITIONED]                                                   |
      | |  mem-estimate=0B mem-reservation=0B                                         |
      | |  tuple-ids=1 row-size=8B cardinality=1                                      |
      | |                                                                             |
      | F00:PLAN FRAGMENT [RANDOM] hosts=130 instances=130                            |
      | Per-Host Resources: mem-estimate=98.00MB mem-reservation=0B                   |
      | 01:AGGREGATE                                                                  |
      | |  output: sum_init_zero(tpch_30000_parquet.lineitem.parquet-stats: num_rows) |
      | |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB                |
      | |  tuple-ids=1 row-size=8B cardinality=1                                      |
      | |                                                                             |
      | 00:SCAN HDFS [tpch_30000_parquet.lineitem, RANDOM]                            |
      |    partitions=2526/2526 files=28976 size=6.89TB                               |
      |    stats-rows=179999978268 extrapolated-rows=disabled                         |
      |    table stats: rows=179999978268 size=unavailable                            |
      |    column stats: all                                                          |
      |    mem-estimate=88.00MB mem-reservation=0B                                    |
      |    tuple-ids=0 row-size=8B cardinality=179999978268                           |
      +-------------------------------------------------------------------------------+
      
      +--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
      | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                      |
      +--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
      | 03:AGGREGATE | 1      | 1.28ms   | 1.28ms   | 1      | 1          | 532.00 KB | 10.00 MB      | FINALIZE                    |
      | 02:EXCHANGE  | 1      | 2.56s    | 2.56s    | 129    | 1          | 0 B       | 0 B           | UNPARTITIONED               |
      | 01:AGGREGATE | 129    | 4.89ms   | 62.84ms  | 129    | 1          | 20.00 KB  | 10.00 MB      |                             |
      | 00:SCAN HDFS | 129    | 62.44ms  | 341.03ms | 28.98K | 180.00B    | 1.75 MB   | 88.00 MB      | tpch_30000_parquet.lineitem |
      +--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: