[ARROW-9924] [Python] Performance regression reading individual Parquet files using Dataset interface - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25955

Description

I haven't investigated very deeply but this seems symptomatic of a problem:

In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})                                                                                                                              

In [28]: pq.write_table(pa.table(df), 'test.parquet')                                                                                                                                     

In [29]: timeit pq.read_table('test.parquet')                                                                                                                                             
79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)                                                                                                                    
66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Attachments

Issue Links

fixes

ARROW-9983 [C++][Dataset][Python] Use larger default batch size than 32K for Datasets API

Resolved

is related to

ARROW-9983 [C++][Dataset][Python] Use larger default batch size than 32K for Datasets API

Resolved

links to

GitHub Pull Request #8188

Activity

People

Assignee:: Ben Kietzman

Reporter:: Wes McKinney

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 06/Sep/20 22:05

Updated:: 11/Jan/23 08:09

Resolved:: 29/Sep/20 19:29

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

4h 10m