Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.15.1
-
None
Description
Hi,
I am currently running a small test with pyarrow to load 2 "partitioned" parquet tables.
The performance seems to be 2 times less with pyarrow-0.15.1 than with pyarrow-0.12.1.
In my test:
- the 'parquet' tables I am loading are called 'reports' & 'memory'
- they have been generated through pandas.to_parquet by specifying the partitions columns
- they are both partitioned by 2 columns 'p_type' and 'p_start'
- it is small tables:
- reports
- 90 partitions (1 parquet file / partition)
- total size: 6.2MB
- memory
- 105 partitions (1 parquet file / partition)
- total size: 9.1MB
- reports
Here is the code of my simple test that tries to read them (I'm using a filter on the p_start partition):
// code placeholder import os import sys import time import pyarrow from pyarrow.parquet import ParquetDataset def load_dataframe(data_dir, table, start_date, end_date): return ParquetDataset(os.path.join(data_dir, table), filters=[('p_start', '>=', start_date), ('p_start', '<=', end_date) ]).read().to_pandas() print(f'pyarrow version;{pyarrow.__version__}') data_dir = sys.argv[1] for i in range(1, 10): start = time.time() start_date = '201912230000' end_date = '202001080000' load_dataframe(sys.argv[1], 'reports', start_date, end_date) load_dataframe(sys.argv[1], 'memory', start_date, end_date) print(f'loaded;in;{time.time()-start}')
Here are the results:
- with pyarrow-0.12.1
$ python -m cProfile -o load_pyarrow_0.12.1.cprof load_df_from_pyarrow.py parquet/
pyarrow version;0.12.1
loaded;in;0.5566098690032959
loaded;in;0.32605648040771484
loaded;in;0.28951501846313477
loaded;in;0.29279112815856934
loaded;in;0.3474299907684326
loaded;in;0.4075736999511719
loaded;in;0.425199031829834
loaded;in;0.34653329849243164
loaded;in;0.300839900970459
(~350ms to load the 2 tables)
- with pyarrow-0.15.1
$ python -m cProfile -o load_pyarrow_0.15.1.cprof load_df_from_pyarrow.py parquet/
pyarrow version;0.15.1
loaded;in;1.1126022338867188
loaded;in;0.8931224346160889
loaded;in;1.3298325538635254
loaded;in;0.8584625720977783
loaded;in;0.9232609272003174
loaded;in;1.0619215965270996
loaded;in;0.8619768619537354
loaded;in;0.8686420917510986
loaded;in;1.1183602809906006
(>800ms to load the 2 tables)
Is there a performance regression here ?
Am I missing something ?
In attachment, you can find the 2 .cprof files.