Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
In the scanner readahead is controlled by "batch_readahead" and "fragment_readahead" (both specified in the scan options). This was mainly motivated on my work with CSV and the defaults of 32 and 8 will cause the scanner to buffer ~256MB of data (given the default block size of 1MB).
For parquet / IPC this would mean we are buffering 256 row groups which is entirely too high.
Rather than make users figure out complex parameters we should have a single readahead limit that is specified in bytes.
This will be "best effort". I'm not suggest we support partial reads of row groups / record batches so if the limit is set very small we still might end up with more in RAM just because we can only load entire row groups.
Attachments
Issue Links
- is depended upon by
-
ARROW-14736 [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails
- Open
-
ARROW-15411 [C++][Datasets] Improve memory usage of datasets
- Open
- is duplicated by
-
ARROW-12030 [C++] Change dataset readahead to be based on available RAM/CPU instead of fixed constants/options
- Closed
- is related to
-
ARROW-16294 [C++] Improve performance of parquet readahead
- Resolved