[IMPALA-5036] Improve COUNT(*) performance of Parquet scans. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
Fix Version/s: Impala 2.10.0
Component/s: Backend
Labels:

Description

select count(*) from parquet_table;
select count(*) from parquet_table group by partition_col;

Impala already has a special code path for fast Parquet scans when no columns are scanned and materialized, but the performance can be significantly improved with a plan+execution change, as follows:

Execution change
Instead of returning empty batches until num_rows have been returned, the Parquet scanner can populate a single slot with the num_rows from the Parquet row groups

Plan change
The count local aggregation needs to be changed to a sum(num_rows_slot) aggregation.
The final distributed plan will be:
scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot))

This optimization is applicable where is only a count and there are no scan predicates.

Attachments

Issue Links

breaks

IMPALA-5650 COUNT(*) optimization causes a crash when legacy aggregation is enabled

Resolved

IMPALA-5679 Count star optimization gives incorrect result for parquet table partitioned by STRING column

Resolved

IMPALA-5648 Count star optimisation regressed Parquet memory estimate accuracy

Resolved

is related to

IMPALA-6501 Optimize count(*) for Kudu scans

Resolved

Activity

People

Assignee:: Taras Bobrovytsky

Reporter:: Alexander Behm

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 06/Mar/17 23:52

Updated:: 05/Apr/22 13:44

Resolved:: 06/Jul/17 18:36