Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1017

Python: Table.to_pandas leaks memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.3.0
    • 0.4.0
    • Python
    • None

    Description

      Running the following code results in ever increasing memory usage, even though I would expect the dataframe to be garbage collected when it goes out of scope. For the size of my parquet file, I see the usage increasing about 3GB per loop:

      from pyarrow import HdfsClient
      
      def read_parquet_file(client, parquet_file):
          parquet = client.read_parquet(parquet_file)
          df = parquet.to_pandas()
      
      client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
      parquet_file = '/my/parquet/file
      while True:
          read_parquet_file(client, parquet_file)
      

      Is there a reference count issue similar to ARROW-362?

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              jporritt James Porritt
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: