[ARROW-16982] Slow reading of partitioned parquet files from S3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 8.0.0
Fix Version/s: None
Component/s: Parquet, Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/20308
Language:
- Python

Description

When reading partitioned files from S3 and using filters to select partitions, the reader will send list requests each time read_table() is called.

# partitioning: s3://bucket/year=xxxx/month=y/day=z

from pyarrow import parquet
parquet.read_table('s3://bucket', filters=[('day', '=', 1)]) # lists s3 bucket
parquet.read_table('s3://bucket', filters=[('day', '=', 2)]) # lists again

This is not a problem if done once, but repeated calls to select different partitions lead to a large amount of (slow and potentially expensive) S3 list requests.

Current workaround is to list and filter partition structure manually, however this is not nearly as convenient as using filters.

If we know that the S3 prefixes did not change, it should be possible to do recursive list only once and load different data multiple times (using only S3 get requests). I suppose this should be possible by using ParquetDataset, however current implementation only allows filters in constructor and not in the read() method.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Blaž Zupančič

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Jul/22 18:29

Updated:: 11/Jan/23 11:47