[ARROW-10308] [Python] read_csv from python is slow on some work loads - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.0.1
Fix Version/s: None
Component/s: C++, Python
Labels:
- csv
- performance
Environment:
Machine: Azure, 48 vcpus, 384GiB ram
OS: Ubuntu 18.04
Dockerfile and script: attached, or here: https://github.com/drorspei/arrow-csv-benchmark

External issue URL:
https://github.com/apache/arrow/issues/26299

Description

Hi!

I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, processing data around 0.5GiB/s. "Real workloads" means many string, float, and all-null columns, and large file size (5-10GiB), though the file size didn't matter to much.

Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of the time is spent on shared pointer lock mechanisms (though I'm not sure if this is to be trusted). I've attached the dumps in svg format.

I've also attached a script and a Dockerfile to run a benchmark, which reproduces the speeds I see. Building the docker image and running it on a large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 0.5GiB/s.

This is all also available here: https://github.com/drorspei/arrow-csv-benchmark

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

profile4.svg
14/Oct/20 18:15
25 kB
Dror Speiser
profile3.svg
14/Oct/20 18:15
25 kB
Dror Speiser
profile2.svg
14/Oct/20 18:15
25 kB
Dror Speiser
profile1.svg
14/Oct/20 18:15
25 kB
Dror Speiser
Dockerfile
14/Oct/20 18:23
0.2 kB
Dror Speiser
benchmark-csv.py
14/Oct/20 18:23
4 kB
Dror Speiser
arrow-csv-benchmark-times.csv
14/Oct/20 20:22
8 kB
Dror Speiser
arrow-csv-benchmark-plot.png
14/Oct/20 20:23
12 kB
Dror Speiser

Issue Links

is related to

ARROW-10328 [C++] Consider using fast-double-parser

Resolved

ARROW-10313 [C++] Improve UTF8 validation speed and CSV string conversion

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Dror Speiser

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Oct/20 18:19

Updated:: 11/Jan/23 08:12