[ARROW-11469] [Python] Performance degradation parquet reading of wide dataframes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.0.0, 1.0.1, 2.0.0, 3.0.0
Fix Version/s: None
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/27353

Description

I noticed a relatively big performance degradation in version 1.0.0+ when trying to load wide dataframes.

For example you should be able to reproduce by doing:

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame(np.random.rand(100, 10000))
table = pa.Table.from_pandas(df)
pq.write_table(table, "temp.parquet")

%timeit pd.read_parquet("temp.parquet")

In version 0.17.0, this takes about 300-400 ms and for anything above and including 1.0.0, this suddenly takes around 2 seconds.

Thanks for looking into this.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-05-03-14-40-09-520.png
03/May/21 21:40
130 kB
Elena Henderson
image-2021-05-03-14-39-59-485.png
03/May/21 21:40
327 kB
Elena Henderson
image-2021-05-03-14-31-41-260.png
03/May/21 21:31
298 kB
Elena Henderson
profile_wide300.svg
02/Feb/21 16:26
57 kB
Joris Van den Bossche

Issue Links

is related to

ARROW-12736 [C++] Eliminate unnecessary copy in FieldPath::Get()

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Axel G

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Feb/21 09:08

Updated:: 11/Jan/23 08:19