Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
Description
select count(*) from parquet_table; select count(*) from parquet_table group by partition_col;
Impala already has a special code path for fast Parquet scans when no columns are scanned and materialized, but the performance can be significantly improved with a plan+execution change, as follows:
Execution change
Instead of returning empty batches until num_rows have been returned, the Parquet scanner can populate a single slot with the num_rows from the Parquet row groups
Plan change
The count local aggregation needs to be changed to a sum(num_rows_slot) aggregation.
The final distributed plan will be:
scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot))
This optimization is applicable where is only a count and there are no scan predicates.
Attachments
Issue Links
- breaks
-
IMPALA-5650 COUNT(*) optimization causes a crash when legacy aggregation is enabled
- Resolved
-
IMPALA-5679 Count star optimization gives incorrect result for parquet table partitioned by STRING column
- Resolved
-
IMPALA-5648 Count star optimisation regressed Parquet memory estimate accuracy
- Resolved
- is related to
-
IMPALA-6501 Optimize count(*) for Kudu scans
- Resolved