[IMPALA-12395] Planner overestimates scan cardinality for queries using count star optimization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 4.3.0
Component/s: fe
Labels:
None

Epic Color:
ghx-label-7

Description

The scan cardinality estimate for count queries doesn't account for the fact that the count optimization only scans metadata and not the actual columns.

Scan for a count query on Parquet store_sales:

Operator #Hosts #Inst Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail -----------------------------------------------------------------------------------------------------------------------------------------------------
00:SCAN S3 6 72 8s131ms 8s496ms 2.71K 8.64B 128.00 KB 88.00 MB tpcds_3000_string_parquet_managed.store_sales

This is a problem with all file/table formats that implement count optimizations (Parquet and also probably ORC and Iceberg).

This problem is more serious than it was in the past because with ~~IMPALA-12091~~ we now rely on scan cardinality estimates for executor group assignments so count queries are likely to get assigned to a larger executor group than needed.

Attachments

Activity

People

Assignee:: Riza Suminto

Reporter:: David Rorke

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Aug/23 21:48

Updated:: 29/Aug/23 03:34

Resolved:: 29/Aug/23 03:34