Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17399

pyarrow may use a lot of memory to load a dataframe from parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 9.0.0
    • None
    • Parquet, Python
    • None
    • linux

    Description

      When a pandas dataframe is loaded from a parquet file using pyarrow.parquet.read_table, the memory usage may grow a lot more than what should be needed to load the dataframe, and it's not freed until the dataframe is deleted.

      The problem is evident when the dataframe has a column containing lists or numpy arrays, while it seems absent (or not noticeable) if the column contains only integer or floats.

      I'm attaching a simple script to reproduce the issue, and a graph created with memory-profiler showing the memory usage.

      In this example, the dataframe created with pandas needs around 1.2 GB, but the memory usage after loading it from parquet is around 16 GB.

      The items of the column are created as numpy arrays and not lists, to be consistent with the types loaded from parquet (pyarrow produces numpy arrays and not lists).

       

      import gc
      import time
      import numpy as np
      import pandas as pd
      import pyarrow
      import pyarrow.parquet
      import psutil
      
      def pyarrow_dump(filename, df, compression="snappy"):
          table = pyarrow.Table.from_pandas(df)
          pyarrow.parquet.write_table(table, filename, compression=compression)
      
      def pyarrow_load(filename):
          table = pyarrow.parquet.read_table(filename)
          return table.to_pandas()
      
      def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
          # gc.collect()
          current_time = time.monotonic() - start_time
          rss = process.memory_info().rss / 2 ** 20
          print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
      
      if __name__ == "__main__":
          print_mem(0)
      
          rows = 5000000
          df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
          print_mem(1)
          
          pyarrow_dump("example.parquet", df)
          print_mem(2)
          
          del df
          print_mem(3)
          time.sleep(3)
          print_mem(4)
      
          df = pyarrow_load("example.parquet")
          print_mem(5)
          time.sleep(3)
          print_mem(6)
      
          del df
          print_mem(7)
          time.sleep(3)
          print_mem(8)
      

      Run with memory-profiler:

      mprof run --multiprocess python test_pyarrow.py
      

      Output:

      mprof: Sampling memory every 0.1s
      running new process
        0 time:       0.0 rss:     135.4
        1 time:       4.9 rss:    1252.2
        2 time:       7.1 rss:    1265.0
        3 time:       7.5 rss:     760.2
        4 time:      10.7 rss:     758.9
        5 time:      19.6 rss:   16745.4
        6 time:      22.6 rss:   16335.4
        7 time:      22.9 rss:   15833.0
        8 time:      25.9 rss:     955.0
      

      Attachments

        1. memory-profiler.png
          168 kB
          Gianluca Ficarelli

        Activity

          People

            Unassigned Unassigned
            gianluca313 Gianluca Ficarelli
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: