Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.0.1
-
None
-
Machine: Azure, 48 vcpus, 384GiB ram
OS: Ubuntu 18.04
Dockerfile and script: attached, or here: https://github.com/drorspei/arrow-csv-benchmark
Description
Hi!
I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, processing data around 0.5GiB/s. "Real workloads" means many string, float, and all-null columns, and large file size (5-10GiB), though the file size didn't matter to much.
Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of the time is spent on shared pointer lock mechanisms (though I'm not sure if this is to be trusted). I've attached the dumps in svg format.
I've also attached a script and a Dockerfile to run a benchmark, which reproduces the speeds I see. Building the docker image and running it on a large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 0.5GiB/s.
This is all also available here: https://github.com/drorspei/arrow-csv-benchmark
Attachments
Attachments
Issue Links
- is related to
-
ARROW-10328 [C++] Consider using fast-double-parser
- Resolved
-
ARROW-10313 [C++] Improve UTF8 validation speed and CSV string conversion
- Resolved