Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7653

Improve accuracy of compute incremental stats cardinality estimation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.0
    • None
    • Frontend
    • ghx-label-8

    Description

      Currently, the operators of a compute [incremental] stats' subquery rely on combined selectivities - as usual - to estimate cardinality, e.g. during aggregation. For example, note the expected cardinality of the aggregation on this subquery:

      F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=4
      Per-Host Resources: mem-estimate=305.20GB mem-reservation=136.00MB
      01:AGGREGATE [STREAMING]
      |  output: [...]
      |  group by: col_a, col_b, col_c
      |  mem-estimate=76.21GB mem-reservation=34.00MB spill-buffer=2.00MB
      |  tuple-ids=1 row-size=104.83KB cardinality=693000
      |
      00:SCAN HDFS [default.test, RANDOM]
         partitions=1/554 files=1 size=109.65MB
         stats-rows=1506374 extrapolated-rows=disabled
         table stats: rows=821958291 size=unavailable
         column stats: all
         mem-estimate=88.00MB mem-reservation=0B
         tuple-ids=0 row-size=2.06KB cardinality=1506374
      

      This was generated as a result of compute incremental stats on a single partition, so the output of that aggregation is a single row. Due to the width of the intermediate rows, such overestimations lead to bloated memory estimates. Since the amount of partitions to be updated is known at plan-time, Impala could use that to set the aggregation's cardinality.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jeszyb Balazs Jeszenszky
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: