Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17566 Über-jira: S3A Hadoop 3.4 features
  3. HADOOP-16189

S3A copy/rename of large files to be parallelized as a multipart operation

Add voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.2.0
    • Fix Version/s: None
    • Component/s: fs/s3
    • Labels:


      AWS docs on copying

      • file < 5GB, can do this as a single operation
      • file > 5GB you MUST use multipart API.

      But even for files < 5GB, that's a really slow operation. And if HADOOP-16188 is to be believed, there's not enough retrying.
      Even if the transfer manager does swtich to multipart copies at some size, just as we do our writes in 32-64 MB blocks, we can do the same for file copy. Something like

      l = len(src)
      if L < fs.s3a.block.size: 
         single copy
        split file by blocks, initiate the upload, then execute each block copy as an operation in the S3A thread pool; once all done: complete the operation.

      + do retries on individual blocks copied, so a failure of one to copy doesn't force retry of the whole upload.

      This is potentially more complex than it sounds, as

      • there's the need to track the ongoing copy operational state
      • handle failures (abort, etc)
      • use the if-modified/version headers to fail fast if the source file changes partway through copy
      • if the len(file)/fs.s3a.block.size > max-block-count, use a bigger block size
      • Maybe need to fall back to the classic operation

      Overall, what sounds simple could get complex fast, or at least a bigger piece of code. Needs to have some PoC of speedup before attempting


        Issue Links



            • Assignee:
              stevel@apache.org Steve Loughran


              • Created:

                Issue deployment