[ARROW-16028] Memory leak in `fragment.to_table` - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 6.0.1
Fix Version/s: None
Component/s: Parquet, Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/31448

Description

This "pseudo" code ends with OOM.

import fsspec
import pyarrow
import pyarrow.parquet as pq

fs = fsspec.filesystem(
    "s3",
    default_cache_type="none",
    default_fill_cache=False,
    **our_storage_options,
)
dataset = pq.ParquetDataset(
    "path in bucket",
    filesystem=fs,
    filters=some_filters,
    use_legacy_dataset=False,
)

# this ends with OOM
dataset.read(columns=columns_to_read)

# and this too
tables = []
for fragment in dataset.fragments:
   tables.append(fragment.to_table(columns=columns_to_read))
all_data = pyarrow.lib.concat_tables(tables)

What is really weird is if we put a debug point in the loop and load just one fragment. It loads, but something keeps eating memory after load until there is no left.

We are trying to read a parquet table that has several files under desired partitions. Each fragment has tens of columns and tens of millions of rows.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: ondrej metelka

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Mar/22 06:56

Updated:: 11/Jan/23 11:41