Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.3.0, Impala 2.5.0
Description
Improve Parquet scanner performance by materialising many values of each column at a time. This would result in tighter loops, better memory access patterns and could avoid a virtual function call to ReadValue() in the inner loop. Currently it essentially does:
for (row = 0; row < num_rows; ++row) { start a new row for (col = 0; col < num_cols; ++col) { materialise next value for column evaluate probe filter for column } if (probe filters passed && EvalConjuncts(row)) { add row to output batch } }
This would change to something like:
initialise buffer of num_row values for each column initialise bitmap with num_row bits. Bit = 1 means filter row out. for (col = 0; col < num_cols; ++col) { materialise num_rows values into buffer during materialisation, set bits in bitmap where probe filter returns false } for (row = 0; row < num_rows; ++row) { if (bitmap[row] == 1) continue materialise row from column buffer if (EvalConjuncts(row)) { add row to output batch } }
Attachments
Issue Links
- blocks
-
IMPALA-2017 Lazy materialization of Parquet columns during query
- Open
- is blocked by
-
IMPALA-2735 Push down conjunct evaluation into Parquet column readers
- Open