Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.15.1
-
Linux OS with RHEL 7.7 distribution
blkcqas037:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Linux OS with RHEL 7.7 distribution blkcqas037:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Description
Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used inĀ ARROW-6876 except I set use_threads=False to make for an apples-to-apples comparison with respect to # of CPUs.
import numpy as np import pyarrow as pa import pyarrow.parquet as pq table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)}) pq.write_table(table, "test_wide.parquet") res = pq.read_table("test_wide.parquet") print(pa.__version__) %time res = pq.read_table("test_wide.parquet", use_threads=False)
In 0.14.1 with use_threads=False:
0.14.1
CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms
Wall time: 525 ms
**
In 0.15.1 with use_threads=False:
0.15.1
CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s
Wall time: 9.93 s
Attachments
Attachments
Issue Links
- relates to
-
ARROW-6876 [Python] Reading parquet file with many columns becomes slow for 0.15.0
- Resolved
- links to