Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
4.0.0
-
None
-
None
Description
Reading a single partition of a parquet file via filters is significantly slower than reading the partition directly.
import pandas as pd size = 100_000 df = pd.DataFrame({'a': [1, 2, 3] * size, 'b': [4, 5, 6] * size}) df.to_parquet('test.parquet', partition_cols=['a']) %timeit pyarrow.parquet.read_table('test.parquet/a=1') %timeit pyarrow.parquet.read_table('test.parquet', filters=[('a', '=', 1)])
gives the timings
2.57 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.18 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Likewise, changing size to 1_000_000 in the above code gives
16.3 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 32.7 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Part of the docs for read_table states:
> Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows.
From this, I expected the performance to be roughly the same.