[ARROW-12683] [C++] Enable fine-grained I/O (coalescing) in IPC reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 7.0.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/28430

Description

~~ARROW-11772~~ enables I/O coalescing in the IPC reader, but the reader operates at the granularity of an entire record batch; even if you're loading only a few columns, the entire record batch is read. When on a high-latency file system (e.g. S3), we may be able to get further performance improvement by traversing the schema and reading only the buffers we need to read. This can be combined with coalescing to reduce the number of I/O calls that need to be made.

(Maybe there's another savings here in that instead of traversing the schema every time to figure out the buffer layout, we can do that only once up front and then reuse the layout subsequently?)

While ArrayLoader already appears to perform this optimization, it's being handed an in-memory buffer in the first place, so no savings are accomplished.

Attachments

Issue Links

is duplicated by

ARROW-14503 [C++][Dataset] Projection pushdown in IPC (feather) format

Closed

is related to

ARROW-13126 Read out only the required columns from a Feather file on Disk

Closed

ARROW-14577 [C++] Enable fine grained IO for async IPC reader

Resolved

relates to

ARROW-11772 [C++] Add asynchronous read to ipc::RecordBatchFileReader

Resolved

links to

GitHub Pull Request #11486

Activity

People

Assignee:: Yue Ni

Reporter:: David Li

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 07/May/21 13:44

Updated:: 11/Jan/23 08:28

Resolved:: 03/Nov/21 14:25

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

15.5h