Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.0.0
-
None
-
None
Description
While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.
Sample Code
import io import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from memory_profiler import profile @profile def read_file(f): table = pq.read_table(f) df = table.to_pandas(strings_to_categorical=True) del table del df def main(): rows = 2000000 df = pd.DataFrame({ "string": ["test"] * rows, "int": [5] * rows, "float": [2.0] * rows, }) table = pa.Table.from_pandas(df, preserve_index=False) parquet_stream = io.BytesIO() pq.write_table(table, parquet_stream) for i in range(3): parquet_stream.seek(0) read_file(parquet_stream) if __name__ == '__main__': main()
Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs
Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 161.7 MiB 161.7 MiB 1 @profile 10 def read_file(f): 11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f) 12 258.2 MiB 46.1 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 258.2 MiB 0.0 MiB 1 del table 14 256.3 MiB -1.9 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 256.3 MiB 256.3 MiB 1 @profile 10 def read_file(f): 11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f) 12 322.2 MiB 43.0 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 322.2 MiB 0.0 MiB 1 del table 14 320.3 MiB -1.9 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 320.3 MiB 320.3 MiB 1 @profile 10 def read_file(f): 11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f) 12 361.7 MiB 34.8 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 361.7 MiB 0.0 MiB 1 del table 14 359.8 MiB -1.9 MiB 1 del df
Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs
Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 138.4 MiB 138.4 MiB 1 @profile 10 def read_file(f): 11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f) 12 219.2 MiB 33.0 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 171.7 MiB -47.5 MiB 1 del table 14 139.3 MiB -32.4 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 139.3 MiB 139.3 MiB 1 @profile 10 def read_file(f): 11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f) 12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 171.5 MiB -47.7 MiB 1 del table 14 139.1 MiB -32.4 MiB 1 del df Filename: C:/run_pyarrow_memoy_leak_sample.py Line # Mem usage Increment Occurences Line Contents ============================================================ 9 139.1 MiB 139.1 MiB 1 @profile 10 def read_file(f): 11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f) 12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True) 13 171.8 MiB -47.5 MiB 1 del table 14 139.3 MiB -32.4 MiB 1 del df
Attachments
Attachments
Issue Links
- is related to
-
ARROW-11009 [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc
- Resolved