[ARROW-1017] Python: Table.to_pandas leaks memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.3.0
Fix Version/s: 0.4.0
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/16611

Description

Running the following code results in ever increasing memory usage, even though I would expect the dataframe to be garbage collected when it goes out of scope. For the size of my parquet file, I see the usage increasing about 3GB per loop:

from pyarrow import HdfsClient

def read_parquet_file(client, parquet_file):
    parquet = client.read_parquet(parquet_file)
    df = parquet.to_pandas()

client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
parquet_file = '/my/parquet/file
while True:
    read_parquet_file(client, parquet_file)

Is there a reference count issue similar to ~~ARROW-362~~?

Attachments

Issue Links

blocks

ARROW-1014 0.4.0 release

Resolved

Activity

People

Assignee:: Wes McKinney

Reporter:: James Porritt

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/May/17 17:12

Updated:: 11/Jan/23 07:12

Resolved:: 14/May/17 18:35