[ARROW-14648] [C++][Dataset] Change scanner readahead limits to be based on bytes instead of number of batches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- datasets
- query-engine

External issue URL:
https://github.com/apache/arrow/issues/30191

Description

In the scanner readahead is controlled by "batch_readahead" and "fragment_readahead" (both specified in the scan options). This was mainly motivated on my work with CSV and the defaults of 32 and 8 will cause the scanner to buffer ~256MB of data (given the default block size of 1MB).

For parquet / IPC this would mean we are buffering 256 row groups which is entirely too high.

Rather than make users figure out complex parameters we should have a single readahead limit that is specified in bytes.

This will be "best effort". I'm not suggest we support partial reads of row groups / record batches so if the limit is set very small we still might end up with more in RAM just because we can only load entire row groups.

Attachments

Issue Links

is depended upon by

ARROW-14736 [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails

Open

ARROW-15411 [C++][Datasets] Improve memory usage of datasets

Open

is duplicated by

ARROW-12030 [C++] Change dataset readahead to be based on available RAM/CPU instead of fixed constants/options

Closed

is related to

ARROW-16294 [C++] Improve performance of parquet readahead

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 09/Nov/21 21:29

Updated:: 11/Jan/23 08:41