Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1053

[Python] Memory leak with RecordBatchFileReader

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: Python
    • Labels:
      None

      Description

      While working on SPARK-13534 and running repeated calls to toPandas, memory usage continues to climb and I isolated to the Python side. The following code reproduces the issue, which looks like a memory leak. Commenting out the block with the RecordBatchFileReader while leaving the writer, memory usage is stable, so I believe the issue is with the reader.

      import pyarrow as pa
      import numpy as np
      import memory_profiler
      import gc
      import io
      
      
      def leak():
          data = [pa.array(np.concatenate([np.random.randn(100000)] * 10))]
          table = pa.Table.from_arrays(data, ['foo'])
          while True:
              print('calling to_pandas')
              print('memory_usage: {0}'.format(memory_profiler.memory_usage()))
              df = table.to_pandas()
      
              batch = pa.RecordBatch.from_pandas(df)
      
              sink = io.BytesIO()
              writer = pa.RecordBatchFileWriter(sink, batch.schema)
              writer.write_batch(batch)
              writer.close()
      
              reader = pa.open_file(pa.BufferReader(sink.getvalue()))
              reader.read_all()
      
              gc.collect()
      
      leak()
      

      Some of the output from the code above:

      calling to_pandas
      memory_usage: [67.0546875]
      calling to_pandas
      memory_usage: [143.95703125]
      calling to_pandas
      memory_usage: [151.58984375]
      calling to_pandas
      memory_usage: [174.453125]
      calling to_pandas
      memory_usage: [189.84765625]
      calling to_pandas
      memory_usage: [212.7109375]
      calling to_pandas
      memory_usage: [228.046875]
      calling to_pandas
      memory_usage: [243.109375]
      calling to_pandas
      memory_usage: [258.4375]
      calling to_pandas
      memory_usage: [273.83203125]
      calling to_pandas
      memory_usage: [288.90234375]
      calling to_pandas
      memory_usage: [304.23046875]
      calling to_pandas
      memory_usage: [319.625]
      

        Attachments

          Activity

            People

            • Assignee:
              wesmckinn Wes McKinney
              Reporter:
              bryanc Bryan Cutler
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: