Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
While working on SPARK-13534 and running repeated calls to toPandas, memory usage continues to climb and I isolated to the Python side. The following code reproduces the issue, which looks like a memory leak. Commenting out the block with the RecordBatchFileReader while leaving the writer, memory usage is stable, so I believe the issue is with the reader.
import pyarrow as pa import numpy as np import memory_profiler import gc import io def leak(): data = [pa.array(np.concatenate([np.random.randn(100000)] * 10))] table = pa.Table.from_arrays(data, ['foo']) while True: print('calling to_pandas') print('memory_usage: {0}'.format(memory_profiler.memory_usage())) df = table.to_pandas() batch = pa.RecordBatch.from_pandas(df) sink = io.BytesIO() writer = pa.RecordBatchFileWriter(sink, batch.schema) writer.write_batch(batch) writer.close() reader = pa.open_file(pa.BufferReader(sink.getvalue())) reader.read_all() gc.collect() leak()
Some of the output from the code above:
calling to_pandas memory_usage: [67.0546875] calling to_pandas memory_usage: [143.95703125] calling to_pandas memory_usage: [151.58984375] calling to_pandas memory_usage: [174.453125] calling to_pandas memory_usage: [189.84765625] calling to_pandas memory_usage: [212.7109375] calling to_pandas memory_usage: [228.046875] calling to_pandas memory_usage: [243.109375] calling to_pandas memory_usage: [258.4375] calling to_pandas memory_usage: [273.83203125] calling to_pandas memory_usage: [288.90234375] calling to_pandas memory_usage: [304.23046875] calling to_pandas memory_usage: [319.625]