[ARROW-11007] [Python] Memory leak in pq.read_table and table.to_pandas - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/18431

Description

While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.

Sample Code

import io

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from memory_profiler import profile


@profile
def read_file(f):
    table = pq.read_table(f)
    df = table.to_pandas(strings_to_categorical=True)
    del table
    del df


def main():
    rows = 2000000
    df = pd.DataFrame({
        "string": ["test"] * rows,
        "int": [5] * rows,
        "float": [2.0] * rows,
    })
    table = pa.Table.from_pandas(df, preserve_index=False)
    parquet_stream = io.BytesIO()
    pq.write_table(table, parquet_stream)

    for i in range(3):
        parquet_stream.seek(0)
        read_file(parquet_stream)


if __name__ == '__main__':
    main()

Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs

Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    161.7 MiB    161.7 MiB           1   @profile
    10                                         def read_file(f):
    11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
    12    258.2 MiB     46.1 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    258.2 MiB      0.0 MiB           1       del table
    14    256.3 MiB     -1.9 MiB           1       del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    256.3 MiB    256.3 MiB           1   @profile
    10                                         def read_file(f):
    11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
    12    322.2 MiB     43.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    322.2 MiB      0.0 MiB           1       del table
    14    320.3 MiB     -1.9 MiB           1       del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    320.3 MiB    320.3 MiB           1   @profile
    10                                         def read_file(f):
    11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
    12    361.7 MiB     34.8 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    361.7 MiB      0.0 MiB           1       del table
    14    359.8 MiB     -1.9 MiB           1       del df

Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs

Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    138.4 MiB    138.4 MiB           1   @profile
    10                                         def read_file(f):
    11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
    12    219.2 MiB     33.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    171.7 MiB    -47.5 MiB           1       del table
    14    139.3 MiB    -32.4 MiB           1       del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    139.3 MiB    139.3 MiB           1   @profile
    10                                         def read_file(f):
    11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
    12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    171.5 MiB    -47.7 MiB           1       del table
    14    139.1 MiB    -32.4 MiB           1       del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    139.1 MiB    139.1 MiB           1   @profile
    10                                         def read_file(f):
    11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
    12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    171.8 MiB    -47.5 MiB           1       del table
    14    139.3 MiB    -32.4 MiB           1       del df

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

benchmark-pandas-parquet.py
03/Feb/22 17:00
3 kB
Peter Gaultney
Screenshot 2022-08-17 at 11.10.05.png
17/Aug/22 09:10
175 kB
Jan Skorepa

Issue Links

is related to

ARROW-11009 [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc

Resolved

[Python] Memory leak in pq.read_table and table.to_pandas

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment