Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8950

[C++] Make head optional in s3fs

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • C++

    Description

      When you open an input file with the f3fs, it issues a head request to S3 to check if the file is present/authorized and get the size (https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/cpp/src/arrow/filesystem/s3fs.cc#L407).

      This call comes with a non-neglictable cost:

      • adds latency
      • priced the same as a GET request by AWS

      I fail to see usecases where this call is really crucial:

      • if the file is not present/authorized, failing at first read seems to have mostly the same effect as failing on opening. I agree that it is kind of "usual" for an open call to fail eagerly, so to avoid surprises we could add a flag indicating if we don't need to fail when running OpenInputFile on an inaccessible file.
      • getting the size can be done on the first read, and could be mostly avoided on caller side if the filesystem api provided read-from-end capabilities (compatible with fs reads using ios::end and on http filesystems with bytes=-xxx). Worst case scenario the call to head could be done lazily when calling getSize().

      I agree that it makes things a bit more complex, and I understand that you would not want to complexify the generic fs api because of blob storage behavior. But obviously there are workloads where this has a significant impact.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            apitrou Antoine Pitrou Assign to me
            rdettai RĂ©mi Dettai
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 50m
                1h 50m

                Slack

                  Issue deployment