Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
ARROW-13572 (https://github.com/apache/arrow/pull/10991) added basic support for ORC file format in the Datasets API, but the reader still reads all columns regardless of the ScanOptions. Since ORC is a columnar format that supports reading only specific fields, we can optimize this step.
The tricky part is to convert the field name of the Arrow schema to the index in the ORC schema. Currently, this logic is included in the Python bindings (https://github.com/apache/arrow/blob/5ca62b910d2de4e705560bef28259b966c7b0dcf/python/pyarrow/orc.py#L36-L59), but so this needs to be moved to C++.
Attachments
Issue Links
- links to