Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
8.0.0
-
Linux, golang 1.18, AMD64
Description
I have submitted a pull request with the changes.
https://github.com/apache/arrow/pull/13120#issuecomment-1123982147
In pqarrow, when getting column readers for columns and struct members, the default behavior is a for loop that serially processes each column. The process of "getting" readers causes a read request, therefore causing these reads always to be issued serially. Additionally, the logic for getting next batch of records is executed in the same way, a for loop iterating through the columns. The performance impact is especially large on high-latency files such as cloud storage.
Additionally, the code to retrieve the next batch of records also issues reads serially.
I'm working with complex parquet files with 500+ "root" columns where some fields are lists of structs. Some of these structs have 100's of columns. In my tests, 800+ read operations are being issued to GCS serially which makes the current state of pqarrow too slow to be usable.
The revision is to concurrently process the columns when retrieving child readers and column readers and to concurrently issue batch requests.
Attachments
Issue Links
- links to