[HADOOP-16189] S3A copy/rename of large files to be parallelized as a multipart operation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: 3.2.0
Fix Version/s: 3.3.2
Component/s: fs/s3
Labels:
None

Description

AWS docs on copying

file < 5GB, can do this as a single operation
file > 5GB you MUST use multipart API.

But even for files < 5GB, that's a really slow operation. And if ~~HADOOP-16188~~ is to be believed, there's not enough retrying.
Even if the transfer manager does swtich to multipart copies at some size, just as we do our writes in 32-64 MB blocks, we can do the same for file copy. Something like

l = len(src)
if L < fs.s3a.block.size: 
   single copy
else: 
  split file by blocks, initiate the upload, then execute each block copy as an operation in the S3A thread pool; once all done: complete the operation.

+ do retries on individual blocks copied, so a failure of one to copy doesn't force retry of the whole upload.

This is potentially more complex than it sounds, as

there's the need to track the ongoing copy operational state
handle failures (abort, etc)
use the if-modified/version headers to fail fast if the source file changes partway through copy
if the len(file)/fs.s3a.block.size > max-block-count, use a bigger block size
Maybe need to fall back to the classic operation

Overall, what sounds simple could get complex fast, or at least a bigger piece of code. Needs to have some PoC of speedup before attempting

Attachments

Issue Links

relates to

HADOOP-16190 S3A copyFile operation to include source versionID or etag in the copy request

Resolved

HADOOP-16188 s3a rename failed during copy, "Unable to copy part" + 200 error code

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Mar/19 14:42

Updated:: 02/Aug/21 15:20

Resolved:: 02/Aug/21 15:20