[IMPALA-7653] Improve accuracy of compute incremental stats cardinality estimation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 3.0
Fix Version/s: None
Component/s: Frontend
Labels:
- resource-management

Epic Color:
ghx-label-8

Description

Currently, the operators of a compute [incremental] stats' subquery rely on combined selectivities - as usual - to estimate cardinality, e.g. during aggregation. For example, note the expected cardinality of the aggregation on this subquery:

F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=4
Per-Host Resources: mem-estimate=305.20GB mem-reservation=136.00MB
01:AGGREGATE [STREAMING]
|  output: [...]
|  group by: col_a, col_b, col_c
|  mem-estimate=76.21GB mem-reservation=34.00MB spill-buffer=2.00MB
|  tuple-ids=1 row-size=104.83KB cardinality=693000
|
00:SCAN HDFS [default.test, RANDOM]
   partitions=1/554 files=1 size=109.65MB
   stats-rows=1506374 extrapolated-rows=disabled
   table stats: rows=821958291 size=unavailable
   column stats: all
   mem-estimate=88.00MB mem-reservation=0B
   tuple-ids=0 row-size=2.06KB cardinality=1506374

This was generated as a result of compute incremental stats on a single partition, so the output of that aggregation is a single row. Due to the width of the intermediate rows, such overestimations lead to bloated memory estimates. Since the amount of partitions to be updated is known at plan-time, Impala could use that to set the aggregation's cardinality.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Balazs Jeszenszky

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Oct/18 14:56

Updated:: 05/Oct/19 21:12