Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6874

[Python] Memory leak in Table.to_pandas() when conversion to object dtype

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.15.0
    • 0.15.1, 0.16.0
    • Python
    • Operating system: Windows 10
      pyarrow installed via conda
      both python environments were identical except pyarrow:
      python: 3.6.7
      numpy: 1.17.2
      pandas: 0.25.1

    Description

      I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python interpreter ran out of memory.

      I narrowed the issue down to the pyarrow.Table.to_pandas() call, which appears to have a memory leak in the latest version. See details below to reproduce this issue.

       

      import numpy as np
      import pandas as pd
      import pyarrow as pa
      
      # create a table with one nested array column
      nested_array = pa.array([np.random.rand(1000) for i in range(500)])
      nested_array.type  # ListType(list<item: double>)
      table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
      
      # convert it to a pandas DataFrame in a loop to monitor memory consumption
      num_iterations = 10000
      # pyarrow v0.14.1: Memory allocation does not grow during loop execution
      # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
      for i in range(num_iterations):
          df = pa.Table.to_pandas(table)
      
      
      # When the table column is not nested, no memory leak is observed
      array = pa.array(np.random.rand(500 * 1000))
      table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
      # no memory leak:
      for i in range(num_iterations):
          df = pa.Table.to_pandas(table)

      Attachments

        1. Screenshot_2020-08-05_10-11-45.png
          157 kB
          jesse ventura

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              mosalx Sergey Mozharov
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h