Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0, 4.0.1, 5.0.0
Description
I have a simple test case where I scan the batches of a 4GB dataset and print out the currently used memory:
import pyarrow as pa import pyarrow.dataset as ds dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv') num_rows = 0 for batch in dataset.to_batches(): print(pa.total_allocated_bytes()) num_rows += batch.num_rows print(num_rows)
In pyarrow 3.0.0 this consumes just over 5MB. In pyarrow 4.0.0 and 5.0.0 this consumes multiple GB of RAM.
Attachments
Issue Links
- is depended upon by
-
ARROW-14191 [C++][Dataset] Dataset writes should respect backpressure
- Resolved
- is duplicated by
-
ARROW-13590 [C++] Ensure dataset writing applies back pressure
- Closed
- links to