Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6874

[Python] Memory leak in Table.to_pandas() when conversion to object dtype

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.15.0
    • Fix Version/s: 1.0.0, 0.15.1
    • Component/s: Python
    • Environment:
      Operating system: Windows 10
      pyarrow installed via conda
      both python environments were identical except pyarrow:
      python: 3.6.7
      numpy: 1.17.2
      pandas: 0.25.1

      Description

      I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python interpreter ran out of memory.

      I narrowed the issue down to the pyarrow.Table.to_pandas() call, which appears to have a memory leak in the latest version. See details below to reproduce this issue.

       

      import numpy as np
      import pandas as pd
      import pyarrow as pa
      
      # create a table with one nested array column
      nested_array = pa.array([np.random.rand(1000) for i in range(500)])
      nested_array.type  # ListType(list<item: double>)
      table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
      
      # convert it to a pandas DataFrame in a loop to monitor memory consumption
      num_iterations = 10000
      # pyarrow v0.14.1: Memory allocation does not grow during loop execution
      # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
      for i in range(num_iterations):
          df = pa.Table.to_pandas(table)
      
      
      # When the table column is not nested, no memory leak is observed
      array = pa.array(np.random.rand(500 * 1000))
      table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
      # no memory leak:
      for i in range(num_iterations):
          df = pa.Table.to_pandas(table)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                apitrou Antoine Pitrou
                Reporter:
                mosalx Sergey Mozharov
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h