Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 2.8.0
-
None
-
ghx-label-3
Description
select min(int_col), max(bigint_col) from parquet_table;
select min(int_col), max(bigint_col) from parquet_table group by partition_col;
select min(int_col), max(int_col) from parquet_table; <--- case a little trickier because int_col refd twice
The slot values for int_col and bigint_col can be directly filled in from the parquet::Statistics, assuming stats are available for both columns. No columns need to be scanned/materialized.
This JIRA focuses on implementing this optimization in the simple case where all scanned columns feed into min/max aggregates and where all columns have parquet::Statistics. Those conditions can be relaxed, but should be addressed separately.
This optimization opportunity must be detected by the planner and is not applicable when there are scan predicates.