Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.15.0
-
Operating system: Windows 10
pyarrow installed via conda
both python environments were identical except pyarrow:
python: 3.6.7
numpy: 1.17.2
pandas: 0.25.1
Description
I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python interpreter ran out of memory.
I narrowed the issue down to the pyarrow.Table.to_pandas() call, which appears to have a memory leak in the latest version. See details below to reproduce this issue.
import numpy as np import pandas as pd import pyarrow as pa # create a table with one nested array column nested_array = pa.array([np.random.rand(1000) for i in range(500)]) nested_array.type # ListType(list<item: double>) table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays']) # convert it to a pandas DataFrame in a loop to monitor memory consumption num_iterations = 10000 # pyarrow v0.14.1: Memory allocation does not grow during loop execution # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected for i in range(num_iterations): df = pa.Table.to_pandas(table) # When the table column is not nested, no memory leak is observed array = pa.array(np.random.rand(500 * 1000)) table = pa.Table.from_arrays(arrays=[array], names=['numbers']) # no memory leak: for i in range(num_iterations): df = pa.Table.to_pandas(table)
Attachments
Attachments
Issue Links
- duplicates
-
ARROW-6976 Possible memory leak in pyarrow read_parquet
- Closed
- links to