Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1053

[Python] Memory leak with RecordBatchFileReader

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.4.0
    • Python
    • None

    Description

      While working on SPARK-13534 and running repeated calls to toPandas, memory usage continues to climb and I isolated to the Python side. The following code reproduces the issue, which looks like a memory leak. Commenting out the block with the RecordBatchFileReader while leaving the writer, memory usage is stable, so I believe the issue is with the reader.

      import pyarrow as pa
      import numpy as np
      import memory_profiler
      import gc
      import io
      
      
      def leak():
          data = [pa.array(np.concatenate([np.random.randn(100000)] * 10))]
          table = pa.Table.from_arrays(data, ['foo'])
          while True:
              print('calling to_pandas')
              print('memory_usage: {0}'.format(memory_profiler.memory_usage()))
              df = table.to_pandas()
      
              batch = pa.RecordBatch.from_pandas(df)
      
              sink = io.BytesIO()
              writer = pa.RecordBatchFileWriter(sink, batch.schema)
              writer.write_batch(batch)
              writer.close()
      
              reader = pa.open_file(pa.BufferReader(sink.getvalue()))
              reader.read_all()
      
              gc.collect()
      
      leak()
      

      Some of the output from the code above:

      calling to_pandas
      memory_usage: [67.0546875]
      calling to_pandas
      memory_usage: [143.95703125]
      calling to_pandas
      memory_usage: [151.58984375]
      calling to_pandas
      memory_usage: [174.453125]
      calling to_pandas
      memory_usage: [189.84765625]
      calling to_pandas
      memory_usage: [212.7109375]
      calling to_pandas
      memory_usage: [228.046875]
      calling to_pandas
      memory_usage: [243.109375]
      calling to_pandas
      memory_usage: [258.4375]
      calling to_pandas
      memory_usage: [273.83203125]
      calling to_pandas
      memory_usage: [288.90234375]
      calling to_pandas
      memory_usage: [304.23046875]
      calling to_pandas
      memory_usage: [319.625]
      

      Attachments

        Activity

          People

            wesm Wes McKinney
            bryanc Bryan Cutler
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: