[ARROW-16530] [Go] Serial read operations on columns, even when parallel = true - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 8.0.0
Fix Version/s: 9.0.0
Component/s: Go
Labels:
- pull-request-available
Environment:
Linux, golang 1.18, AMD64

Flags:

Important
External issue URL:
https://github.com/apache/arrow/issues/31891

Description

I have submitted a pull request with the changes.

https://github.com/apache/arrow/pull/13120#issuecomment-1123982147

In pqarrow, when getting column readers for columns and struct members, the default behavior is a for loop that serially processes each column. The process of "getting" readers causes a read request, therefore causing these reads always to be issued serially. Additionally, the logic for getting next batch of records is executed in the same way, a for loop iterating through the columns. The performance impact is especially large on high-latency files such as cloud storage.

Additionally, the code to retrieve the next batch of records also issues reads serially.

I'm working with complex parquet files with 500+ "root" columns where some fields are lists of structs. Some of these structs have 100's of columns. In my tests, 800+ read operations are being issued to GCS serially which makes the current state of pqarrow too slow to be usable.

The revision is to concurrently process the columns when retrieving child readers and column readers and to concurrently issue batch requests.

Attachments

Issue Links

links to

GitHub Pull Request #13120

Activity

People

Assignee:: Unassigned

Reporter:: Robert Purdom

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/May/22 16:30

Updated:: 11/Jan/23 11:44

Resolved:: 17/May/22 14:57

Time Tracking

Estimated:

24h

Remaining:

20.5h

Logged:

3.5h