Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17481

[C++] [Python] Major performance improvements to CSV reading from S3




      The current dataset reader for CSV is pretty slow on EC2 reading from S3.

      EC2 instances have more than 3Gbps network bandwidth which make them on par with SSD. However reading batches from disk is more than 3x faster than reading from network. This should not happen.

      The reason why the dataset reader is not fully leveraging the network bandwidth is because reads are currently serial. We should change the reads to be parallel. Then even if the rest of the pipeline is not parallel we should get same read speed as disk.

      Note one might think that if you have many fragments fragment-level parallelism will take care of this. This is true to some extent however to_batches() is ordered. This means that if your fragments are big the fragment readahead will stop being effective after a while as the reader tries to deplete the fragments in order. The batch readahead for the CSV reader current is a serial readahead, which really should be a parallel readahead.

      After changing the network IO to be parallel, we should also change the parse and decode to be parallel. It's easy to change the parse to be parallel, a bit harder for the decode because of how the decoder operator works, so I will just tackle the parse first.

      On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these changes (parallel reading and parallel parsing) made reading 60 batches (~10GB) 4x faster. Note these changes will also make disk reading faster due to parallel parse.




            marsupialtail Ziheng Wang
            marsupialtail Ziheng Wang
            0 Vote for this issue
            3 Start watching this issue



              Time Tracking

                Original Estimate - Not Specified
                Not Specified
                Remaining Estimate - 0h
                Time Spent - 7h 50m
                7h 50m