[IMPALA-9228] ORC scanner could be vectorized - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 4.0.0, Impala 3.4.0
Component/s: None
Labels:
- orc

Epic Link:
Basic ORC Support
Epic Color:
ghx-label-9

Description

The ORC scanners uses an external library to read ORC files. The library reads the file contents into its own memory representation. It is a vectorized representation similar to the Arrow format.

Impala needs to convert the ORC row batch to an Impala row batch. Currently the conversion happens row-wise via virtual function calls:

https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/hdfs-orc-scanner.cc#L671

https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L352

Instead of this approach it could work similarly to the Parquet scanner that fills the columns one-by-one into a scratch batch, then evaluate the conjuncts on the scratch batch. For more details see HdfsParquetScanner::AssembleRows():

https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1077-L1088

This way we'll need a lot less virtual function calls, also the memory reads/writes will be much more localized and predictable.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1-4_col_measurement_int_only.png
24/Jan/20 15:41
54 kB
Gabor Kaszab

Activity

People

Assignee:: Gabor Kaszab

Reporter:: Zoltán Borók-Nagy

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Dec/19 15:56

Updated:: 05/Feb/24 07:52

Resolved:: 04/Mar/20 07:40