Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6874

[Python] Memory leak in Table.to_pandas() when conversion to object dtype

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.15.0
    • 0.15.1, 0.16.0
    • Python
    • Operating system: Windows 10
      pyarrow installed via conda
      both python environments were identical except pyarrow:
      python: 3.6.7
      numpy: 1.17.2
      pandas: 0.25.1

    Description

      I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python interpreter ran out of memory.

      I narrowed the issue down to the pyarrow.Table.to_pandas() call, which appears to have a memory leak in the latest version. See details below to reproduce this issue.

       

      import numpy as np
      import pandas as pd
      import pyarrow as pa
      
      # create a table with one nested array column
      nested_array = pa.array([np.random.rand(1000) for i in range(500)])
      nested_array.type  # ListType(list<item: double>)
      table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
      
      # convert it to a pandas DataFrame in a loop to monitor memory consumption
      num_iterations = 10000
      # pyarrow v0.14.1: Memory allocation does not grow during loop execution
      # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
      for i in range(num_iterations):
          df = pa.Table.to_pandas(table)
      
      
      # When the table column is not nested, no memory leak is observed
      array = pa.array(np.random.rand(500 * 1000))
      table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
      # no memory leak:
      for i in range(num_iterations):
          df = pa.Table.to_pandas(table)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            apitrou Antoine Pitrou
            mosalx Sergey Mozharov
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1.5h
                1.5h

                Slack

                  Issue deployment