Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
ghx-label-7
Description
The scan cardinality estimate for count queries doesn't account for the fact that the count
optimization only scans metadata and not the actual columns.
Scan for a count query on Parquet store_sales:
Operator #Hosts #Inst Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ----------------------------------------------------------------------------------------------------------------------------------------------------- 00:SCAN S3 6 72 8s131ms 8s496ms 2.71K 8.64B 128.00 KB 88.00 MB tpcds_3000_string_parquet_managed.store_sales
This is a problem with all file/table formats that implement count optimizations (Parquet and also probably ORC and Iceberg).
This problem is more serious than it was in the past because with IMPALA-12091 we now rely on scan cardinality estimates for executor group assignments so count queries are likely to get assigned to a larger executor group than needed.