[ARROW-15410] [C++][Datasets] Improve memory usage of datasets API when scanning parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.0.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/30892

Description

This is a more targeted fix to improve memory usage when scanning parquet files. It is related to broader issues like ARROW-14648 but those will likely take longer to fix. The goal here is to make it possible to scan large parquet datasets with many files where each file has reasonably sized row groups (e.g. 1 million rows). Currently we run out of memory scanning a configuration as simple as:

21 parquet files
Each parquet file has 10 million rows split into row groups of size 1 million

Attachments

Issue Links

is depended upon by

ARROW-15411 [C++][Datasets] Improve memory usage of datasets

Open

links to

GitHub Pull Request #12228

Activity

People

Assignee:: Weston Pace

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 22/Jan/22 01:40

Updated:: 11/Jan/23 11:36

Resolved:: 22/Apr/22 22:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

5h 10m