Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14429

[C++] RecordBatchFileReader performance really bad in S3

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.0.0
    • 7.0.0
    • C++

    Description

      We are using RecordBatchFileWriter to write Arrow type directly to S3 using the S3FileSystem, then using RecordBatchFileReader to read from S3. The write is pretty efficient, write a 50MB finishes within 0.2s. But reading that file is taking 30s, which is definitely too long. Then I did several tests:

      1. I tried to use S3FileSystem to read the file into bytes, it's only taking 1s. which somehow makes me believe it's an issue with RecordBatchFileReader
      2. Half the size (around 25MB), with RecordBatchFileReader took 17s, without RecordBatchFileReader took 0.28s
      3. Double the size (around 100MB), with RecordBatchFileReader took 61s, without RecordBatchFileReader took 2.3s
      4. I tried to get all bytes using S3FileSystem first, then create a reader from the bytes. Then read all context from the reader, it's only taking 0.1s. 

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              lingkai2 Lingkai Kong
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 50m
                  3h 50m