[ARROW-13611] [C++] Scanning datasets does not enforce back pressure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0, 4.0.1, 5.0.0
Fix Version/s: 6.0.0
Component/s: C++
Labels:
- pull-request-available
- query-engine

External issue URL:
https://github.com/apache/arrow/issues/29252

Description

I have a simple test case where I scan the batches of a 4GB dataset and print out the currently used memory:

import pyarrow as pa
import pyarrow.dataset as ds

dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv')
num_rows = 0
for batch in dataset.to_batches():
    print(pa.total_allocated_bytes())
    num_rows += batch.num_rows

print(num_rows)

In pyarrow 3.0.0 this consumes just over 5MB. In pyarrow 4.0.0 and 5.0.0 this consumes multiple GB of RAM.

Attachments

Issue Links

is depended upon by

ARROW-14191 [C++][Dataset] Dataset writes should respect backpressure

Resolved

is duplicated by

ARROW-13590 [C++] Ensure dataset writing applies back pressure

Closed

links to

GitHub Pull Request #11285

Activity

People

Assignee:: Weston Pace

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Aug/21 03:34

Updated:: 11/Jan/23 08:34

Resolved:: 12/Oct/21 19:12

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

4h 50m