Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-15620 Über-jira: S3A phase VI: Hadoop 3.3 features
  3. HADOOP-16189

S3A copy/rename of large files to be parallelized as a multipart operation

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.2.0
    • Fix Version/s: None
    • Component/s: fs/s3
    • Labels:
      None

      Description

      AWS docs on copying

      • file < 5GB, can do this as a single operation
      • file > 5GB you MUST use multipart API.

      But even for files < 5GB, that's a really slow operation. And if HADOOP-16188 is to be believed, there's not enough retrying.
      Even if the transfer manager does swtich to multipart copies at some size, just as we do our writes in 32-64 MB blocks, we can do the same for file copy. Something like

      l = len(src)
      if L < fs.s3a.block.size: 
         single copy
      else: 
        split file by blocks, initiate the upload, then execute each block copy as an operation in the S3A thread pool; once all done: complete the operation.
      

      + do retries on individual blocks copied, so a failure of one to copy doesn't force retry of the whole upload.

      This is potentially more complex than it sounds, as

      • there's the need to track the ongoing copy operational state
      • handle failures (abort, etc)
      • use the if-modified/version headers to fail fast if the source file changes partway through copy
      • if the len(file)/fs.s3a.block.size > max-block-count, use a bigger block size
      • Maybe need to fall back to the classic operation

      Overall, what sounds simple could get complex fast, or at least a bigger piece of code. Needs to have some PoC of speedup before attempting

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                stevel@apache.org Steve Loughran
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: