[ARROW-13369] [C++][python] performance of read_table using filters on a partitioned parquet file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 4.0.0
Fix Version/s: None
Component/s: C++, Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/18743

Description

Reading a single partition of a parquet file via filters is significantly slower than reading the partition directly.

import pandas as pd
size = 100_000
df = pd.DataFrame({'a': [1, 2, 3] * size, 'b': [4, 5, 6] * size})
df.to_parquet('test.parquet', partition_cols=['a'])
%timeit pyarrow.parquet.read_table('test.parquet/a=1')
%timeit pyarrow.parquet.read_table('test.parquet', filters=[('a', '=', 1)])

gives the timings

2.57 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.18 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Likewise, changing size to 1_000_000 in the above code gives

16.3 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Part of the docs for read_table states:

> Partition keys embedded in a nested directory structure will be exploited to avoid loading files at all if they contain no matching rows.

From this, I expected the performance to be roughly the same.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Richard Shadrach

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Jul/21 19:18

Updated:: 11/Jan/23 08:32