[ARROW-17399] pyarrow may use a lot of memory to load a dataframe from parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 9.0.0
Fix Version/s: None
Component/s: Parquet, Python
Labels:
None
Environment:
linux

External issue URL:
https://github.com/apache/arrow/issues/32670
Language:
- Python

Description

When a pandas dataframe is loaded from a parquet file using pyarrow.parquet.read_table, the memory usage may grow a lot more than what should be needed to load the dataframe, and it's not freed until the dataframe is deleted.

The problem is evident when the dataframe has a column containing lists or numpy arrays, while it seems absent (or not noticeable) if the column contains only integer or floats.

I'm attaching a simple script to reproduce the issue, and a graph created with memory-profiler showing the memory usage.

In this example, the dataframe created with pandas needs around 1.2 GB, but the memory usage after loading it from parquet is around 16 GB.

The items of the column are created as numpy arrays and not lists, to be consistent with the types loaded from parquet (pyarrow produces numpy arrays and not lists).

import gc
import time
import numpy as np
import pandas as pd
import pyarrow
import pyarrow.parquet
import psutil

def pyarrow_dump(filename, df, compression="snappy"):
    table = pyarrow.Table.from_pandas(df)
    pyarrow.parquet.write_table(table, filename, compression=compression)

def pyarrow_load(filename):
    table = pyarrow.parquet.read_table(filename)
    return table.to_pandas()

def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
    # gc.collect()
    current_time = time.monotonic() - start_time
    rss = process.memory_info().rss / 2 ** 20
    print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")

if __name__ == "__main__":
    print_mem(0)

    rows = 5000000
    df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
    print_mem(1)
    
    pyarrow_dump("example.parquet", df)
    print_mem(2)
    
    del df
    print_mem(3)
    time.sleep(3)
    print_mem(4)

    df = pyarrow_load("example.parquet")
    print_mem(5)
    time.sleep(3)
    print_mem(6)

    del df
    print_mem(7)
    time.sleep(3)
    print_mem(8)

Run with memory-profiler:

mprof run --multiprocess python test_pyarrow.py

Output:

mprof: Sampling memory every 0.1s
running new process
  0 time:       0.0 rss:     135.4
  1 time:       4.9 rss:    1252.2
  2 time:       7.1 rss:    1265.0
  3 time:       7.5 rss:     760.2
  4 time:      10.7 rss:     758.9
  5 time:      19.6 rss:   16745.4
  6 time:      22.6 rss:   16335.4
  7 time:      22.9 rss:   15833.0
  8 time:      25.9 rss:     955.0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

memory-profiler.png
12/Aug/22 15:29
168 kB
Gianluca Ficarelli

Activity

People

Assignee:: Unassigned

Reporter:: Gianluca Ficarelli

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Aug/22 15:34

Updated:: 11/Jan/23 11:50