[ARROW-6876] [Python] Reading parquet file with many columns becomes slow for 0.15.0 - ASF JIRA

XML

Word

Printable

JSON

Hi,

I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas.

Example:

With 0.14.1
In [4]: %timeit df = pd.read_parquet(path)
2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With 0.15.0
In [5]: %timeit df = pd.read_parquet(path)
22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas.

Have you received similar complain? What could be the issue here?

Thanks a lot.

Edit1:

Some profiling I did:

0.14.1:

0.15.0:

is related to

ARROW-7059 [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x

links to

GitHub Pull Request #5653

Estimated:

Not Specified

Remaining:

Logged:

2h 40m