Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
7.0.0, 8.0.0, 9.0.0
-
None
-
None
-
python 3.9, ubuntu 20.04
Description
Description
Performance when reading columns using feather.read_table on Arrow 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.
Profiling the code below shows that the bottleneck is somewhere in the read_names function of pyarrow._feather.FeatherReader.
Example
Setup code:
import pandas as pd from pyarrow import feather rows, cols = (1_000_000, 10) data = {f'c{c}': range(rows) for c in range(cols)} df = pd.DataFrame(data=data) feather.write_feather(df, 'test.feather', compression="uncompressed")
Benchmarks Arrow 9.0.0:
%timeit feather.read_table('test.feather', memory_map=True) %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True) > 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Benchmarks Arrow 6.0.0:
%timeit feather.read_table('test.feather', memory_map=True) %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True) > 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Attachments
Issue Links
- is related to
-
ARROW-18113 [C++] Implement a read range process without caching
- Resolved