Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
6.0.1
-
None
-
None
Description
This "pseudo" code ends with OOM.
import fsspec import pyarrow import pyarrow.parquet as pq fs = fsspec.filesystem( "s3", default_cache_type="none", default_fill_cache=False, **our_storage_options, ) dataset = pq.ParquetDataset( "path in bucket", filesystem=fs, filters=some_filters, use_legacy_dataset=False, ) # this ends with OOM dataset.read(columns=columns_to_read) # and this too tables = [] for fragment in dataset.fragments: tables.append(fragment.to_table(columns=columns_to_read)) all_data = pyarrow.lib.concat_tables(tables)
What is really weird is if we put a debug point in the loop and load just one fragment. It loads, but something keeps eating memory after load until there is no left.
We are trying to read a parquet table that has several files under desired partitions. Each fragment has tens of columns and tens of millions of rows.