Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10308

[Python] read_csv from python is slow on some work loads

    XMLWordPrintableJSON

    Details

      Description

      Hi!

      I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, processing data around 0.5GiB/s. "Real workloads" means many string, float, and all-null columns, and large file size (5-10GiB), though the file size didn't matter to much.

      Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of the time is spent on shared pointer lock mechanisms (though I'm not sure if this is to be trusted). I've attached the dumps in svg format.

      I've also attached a script and a Dockerfile to run a benchmark, which reproduces the speeds I see. Building the docker image and running it on a large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 0.5GiB/s.

      This is all also available here: https://github.com/drorspei/arrow-csv-benchmark

        Attachments

        1. profile1.svg
          25 kB
          Dror Speiser
        2. profile3.svg
          25 kB
          Dror Speiser
        3. profile4.svg
          25 kB
          Dror Speiser
        4. profile2.svg
          25 kB
          Dror Speiser
        5. Dockerfile
          0.2 kB
          Dror Speiser
        6. benchmark-csv.py
          4 kB
          Dror Speiser
        7. arrow-csv-benchmark-times.csv
          8 kB
          Dror Speiser
        8. arrow-csv-benchmark-plot.png
          12 kB
          Dror Speiser

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                drorspei Dror Speiser
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: