Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
ghx-label-9
Description
https://issues.apache.org/jira/browse/IMPALA-9228 introduced vectorization for primitive types and struct. This Jira covers the same for collections (array, map) and structs containing collections.
Prerequisite:
1) As a prerequisite please check how IMPALA-9228 introduces scratch batches to hold a batch rows, and also check how it's populated by primitives or struct fields.
2) Read the following document to understand the difference between materialising and non-materialising collection readers: https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A
3) Check how parquet handles collections when populating its scratch batch.
Implementation details:
1) Taking care of materialising collections readers should be done similarly as for primitive types. In this case each collection reader will write one slot into the outgoing RowBatch per each collection it reads. In other words one collection will be represented as one CollectionValue in RowBatch.
2) The other case is when the top-level collection reader doesn't materialise directly into RowBatch, instead, it delegates the materialisation to its children. In this case it's not guaranteed that number of required slots in the RowBatch will equal to the number of collections in the collection reader.
E.g.: Let's assume a table with one column: list of integers. In this case if the top-level ListColumnReader is not materialising then its child, the IntColumnReader will. But the number of required slots will be the number of int values within the collections instead of the number of collection as it would be if the ListColumnReader was materialising directly.
As a Result if the scratch batch is being populated we might get to a situation where a whole collection doesn't fit into the scratch batch. Check how Parquet handles this case.
3) Once populating the scratch batch is done for collections it has to be verified that codegen is also run in these cases. It should work out of the box but let's make sure.
4) Currently ORC scanner chooses between row-by-row processing of the rows read by ORC reader and scratch batch reading. Once this Jira is implemented the row-by-row approach is not needed anymore.