Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11007

[Python] Memory leak in pq.read_table and table.to_pandas

Add voteWatch issue
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • None
    • Python
    • None

    Description

      While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.

      Sample Code

      import io
      
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      from memory_profiler import profile
      
      
      @profile
      def read_file(f):
          table = pq.read_table(f)
          df = table.to_pandas(strings_to_categorical=True)
          del table
          del df
      
      
      def main():
          rows = 2000000
          df = pd.DataFrame({
              "string": ["test"] * rows,
              "int": [5] * rows,
              "float": [2.0] * rows,
          })
          table = pa.Table.from_pandas(df, preserve_index=False)
          parquet_stream = io.BytesIO()
          pq.write_table(table, parquet_stream)
      
          for i in range(3):
              parquet_stream.seek(0)
              read_file(parquet_stream)
      
      
      if __name__ == '__main__':
          main()
      

      Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs

      Filename: C:/run_pyarrow_memoy_leak_sample.py
      
      Line #    Mem usage    Increment  Occurences   Line Contents
      ============================================================
           9    161.7 MiB    161.7 MiB           1   @profile
          10                                         def read_file(f):
          11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
          12    258.2 MiB     46.1 MiB           1       df = table.to_pandas(strings_to_categorical=True)
          13    258.2 MiB      0.0 MiB           1       del table
          14    256.3 MiB     -1.9 MiB           1       del df
      
      
      Filename: C:/run_pyarrow_memoy_leak_sample.py
      
      Line #    Mem usage    Increment  Occurences   Line Contents
      ============================================================
           9    256.3 MiB    256.3 MiB           1   @profile
          10                                         def read_file(f):
          11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
          12    322.2 MiB     43.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
          13    322.2 MiB      0.0 MiB           1       del table
          14    320.3 MiB     -1.9 MiB           1       del df
      
      
      Filename: C:/run_pyarrow_memoy_leak_sample.py
      
      Line #    Mem usage    Increment  Occurences   Line Contents
      ============================================================
           9    320.3 MiB    320.3 MiB           1   @profile
          10                                         def read_file(f):
          11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
          12    361.7 MiB     34.8 MiB           1       df = table.to_pandas(strings_to_categorical=True)
          13    361.7 MiB      0.0 MiB           1       del table
          14    359.8 MiB     -1.9 MiB           1       del df
      

      Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs

      Filename: C:/run_pyarrow_memoy_leak_sample.py
      
      Line #    Mem usage    Increment  Occurences   Line Contents
      ============================================================
           9    138.4 MiB    138.4 MiB           1   @profile
          10                                         def read_file(f):
          11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
          12    219.2 MiB     33.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
          13    171.7 MiB    -47.5 MiB           1       del table
          14    139.3 MiB    -32.4 MiB           1       del df
      
      
      Filename: C:/run_pyarrow_memoy_leak_sample.py
      
      Line #    Mem usage    Increment  Occurences   Line Contents
      ============================================================
           9    139.3 MiB    139.3 MiB           1   @profile
          10                                         def read_file(f):
          11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
          12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
          13    171.5 MiB    -47.7 MiB           1       del table
          14    139.1 MiB    -32.4 MiB           1       del df
      
      
      Filename: C:/run_pyarrow_memoy_leak_sample.py
      
      Line #    Mem usage    Increment  Occurences   Line Contents
      ============================================================
           9    139.1 MiB    139.1 MiB           1   @profile
          10                                         def read_file(f):
          11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
          12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
          13    171.8 MiB    -47.5 MiB           1       del table
          14    139.3 MiB    -32.4 MiB           1       del df
      

      Attachments

        1. benchmark-pandas-parquet.py
          3 kB
          Peter Gaultney
        2. Screenshot 2022-08-17 at 11.10.05.png
          175 kB
          Jan Skorepa

        Issue Links

          Activity

            People

              westonpace Weston Pace
              mpeleshenko Michael Peleshenko

              Dates

                Created:
                Updated:

                Slack

                  Issue deployment