Description
AWS docs on copying
- file < 5GB, can do this as a single operation
- file > 5GB you MUST use multipart API.
But even for files < 5GB, that's a really slow operation. And if HADOOP-16188 is to be believed, there's not enough retrying.
Even if the transfer manager does swtich to multipart copies at some size, just as we do our writes in 32-64 MB blocks, we can do the same for file copy. Something like
l = len(src) if L < fs.s3a.block.size: single copy else: split file by blocks, initiate the upload, then execute each block copy as an operation in the S3A thread pool; once all done: complete the operation.
+ do retries on individual blocks copied, so a failure of one to copy doesn't force retry of the whole upload.
This is potentially more complex than it sounds, as
- there's the need to track the ongoing copy operational state
- handle failures (abort, etc)
- use the if-modified/version headers to fail fast if the source file changes partway through copy
- if the len(file)/fs.s3a.block.size > max-block-count, use a bigger block size
- Maybe need to fall back to the classic operation
Overall, what sounds simple could get complex fast, or at least a bigger piece of code. Needs to have some PoC of speedup before attempting
Attachments
Issue Links
- relates to
-
HADOOP-16190 S3A copyFile operation to include source versionID or etag in the copy request
- Resolved
-
HADOOP-16188 s3a rename failed during copy, "Unable to copy part" + 200 error code
- Resolved