[ARROW-11855] [C++] [Python] Memory leak in to_pandas when converting chunked struct array - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0
Component/s: C++, Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/27702

Description

Reproduction from shadowdsp

import io
import pandas as pd
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
import pyarrow.parquet as pq
from memory_profiler import profile

@profile
def read_file(f):
    table = pq.read_table(f)
    df = table.to_pandas(strings_to_categorical=True)
    del table
    del df

def main():
    rows = 2000000
    df = pd.DataFrame({
        "string": [{"test": [1, 2], "test1": [3, 4]}] * rows,
        "int": [5] * rows,
        "float": [2.0] * rows,
    })
    table = pa.Table.from_pandas(df, preserve_index=False)
    parquet_stream = io.BytesIO()
    pq.write_table(table, parquet_stream)
    for i in range(3):
        parquet_stream.seek(0)
        read_file(parquet_stream)

if __name__ == '__main__':
    main()

Attachments

Issue Links

links to

GitHub Pull Request #9626

Activity

People

Assignee:: Weston Pace

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 04/Mar/21 01:58

Updated:: 11/Jan/23 08:22

Resolved:: 09/Mar/21 15:38

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m