[ARROW-5993] [Python] Reading a dictionary column from Parquet results in disproportionate memory usage - ASF JIRA

XML

Word

Printable

JSON

I'm using pyarrow to read a 40MB parquet file.

When reading all of the columns besides the "body" columns, the process peaks at 170MB.

Reading only the "body" column results in over 6GB of memory used.

I made the file publicly accessible: s3://dhavivresearch/pyarrow/demofile.parquet

duplicates

ARROW-6060 [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

is caused by

ARROW-6060 [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

relates to

ARROW-3772 [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray