[ARROW-7059] [Python] Reading parquet file with many columns is much slower in 0.15.x versus 0.14.x - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.15.1
Fix Version/s: 0.16.0
Component/s: Python
Labels:
Environment:

Hide
Linux OS with RHEL 7.7 distribution

blkcqas037:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Show
Linux OS with RHEL 7.7 distribution blkcqas037:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

External issue URL:
https://github.com/apache/arrow/issues/23368

Description

Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in ~~ARROW-6876~~ except I set use_threads=False to make for an apples-to-apples comparison with respect to # of CPUs.

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)})
pq.write_table(table, "test_wide.parquet")
res = pq.read_table("test_wide.parquet")
print(pa.__version__)
%time res = pq.read_table("test_wide.parquet", use_threads=False)

In 0.14.1 with use_threads=False:

0.14.1
CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms
Wall time: 525 ms
**

In 0.15.1 with use_threads=False:

0.15.1
CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s
Wall time: 9.93 s

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2019-11-06-08-18-42-783.png
06/Nov/19 16:18
4 kB
Eric Kisslinger
image-2019-11-06-08-19-11-662.png
06/Nov/19 16:19
105 kB
Eric Kisslinger
image-2019-11-06-08-23-18-897.png
06/Nov/19 16:23
96 kB
Eric Kisslinger
image-2019-11-06-08-25-05-885.png
06/Nov/19 16:25
109 kB
Eric Kisslinger
image-2019-11-06-09-23-54-372.png
06/Nov/19 17:23
32 kB
Eric Kisslinger
image-2019-11-06-13-16-05-102.png
06/Nov/19 21:16
26 kB
Eric Kisslinger

Issue Links

relates to

ARROW-6876 [Python] Reading parquet file with many columns becomes slow for 0.15.0

Resolved

links to

GitHub Pull Request #6181

Activity

People

Assignee:: Wes McKinney

Reporter:: Eric Kisslinger

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 04/Nov/19 22:07

Updated:: 11/Jan/23 07:51

Resolved:: 14/Jan/20 20:26

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: