[ARROW-17913] feather.read_table 150x slower when reading columns in newer versions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 7.0.0, 8.0.0, 9.0.0
Fix Version/s: None
Component/s: None
Labels:
- feather
- performance
Environment:
python 3.9, ubuntu 20.04

External issue URL:
https://github.com/apache/arrow/issues/33123
Language:
- Python

Description

Performance when reading columns using feather.read_table on Arrow 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.

Profiling the code below shows that the bottleneck is somewhere in the read_names function of pyarrow._feather.FeatherReader.

Example

Setup code:

import pandas as pd
from pyarrow import feather

rows, cols = (1_000_000, 10)
data = {f'c{c}': range(rows) for c in range(cols)}
df = pd.DataFrame(data=data)

feather.write_feather(df, 'test.feather', compression="uncompressed")

Benchmarks Arrow 9.0.0:

%timeit feather.read_table('test.feather', memory_map=True)
%timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)

> 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Benchmarks Arrow 6.0.0:

%timeit feather.read_table('test.feather', memory_map=True)
%timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)

> 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Attachments

Issue Links

is related to

ARROW-18113 [C++] Implement a read range process without caching

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Håkon Magne Holmen

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 02/Oct/22 22:04

Updated:: 11/Jan/23 11:57