[ARROW-14503] [C++][Dataset] Projection pushdown in IPC (feather) format - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/30060

Description

The datasets API uses the RecordBatchFileReader to read feather files. This reader will always "read" the entire file. If the file is memory mapped this might not be a true read. However, the datasets API never uses memory mapped files.

This large read from RAM (or worse, disk) becomes a bottleneck for simple queries that load only a few columns from the dataset.

The fix may be to modify the reader to seek out and pluck only the needed data. Or the fix may be to modify the datasets API to use memory mapped files when possible (although the former approach seems more generally applicable).

This is related to ARROW-8250 but that issue seems more focused on row filtering while this issue is for column filtering.

Attachments

Issue Links

duplicates

ARROW-12683 [C++] Enable fine-grained I/O (coalescing) in IPC reader

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Oct/21 11:55

Updated:: 11/Jan/23 08:40

Resolved:: 28/Oct/21 18:12

Agile

View on Board

[C++][Dataset] Projection pushdown in IPC (feather) format