Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8875

[C++] use AWS SDK SetResponseStreamFactory to avoid a copy of bytes

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • 1.0.0
    • C++

    Description

      Currently, in `GetObjectRange` of f3fs the `GetObjectRequest` has no `ResponseStreamFactory` assigned. This means that the bytes returned by the S3 API are first sent to a `std::basic_stringbuf`. To my understanding this has two performance impacts:

      • `std::basic_stringbuf` uses a growing array to buffer the response, so lots of allocations here
      • on top of that, you have a copy operation from the `std::basic_stringbuf` when data is read into the Arrow buffer.

      This seems to be a bit costly.

      With `ResponseStreamFactory`, we might manage to get the data directly into the Arrow buffer.

      I can take a try at it, but I would need some advice. Is there an existing utility to stream data into an Arrow buffer (if it exists, it is well hidden!) ? or should I stream the data into a plain array and then transfer ownership to Arrow ?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rdettai RĂ©mi Dettai
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: