Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.0
-
None
Description
Currently on S3 the checkpoint file manager (called FileContextBasedCheckpointFileManager) is based on rename. So when a file is opened for an atomic stream a temporary file used instead and when the stream is committed the file is renamed.
But on S3 a rename will be a file copy. So it has some serious performance implication.
But on Hadoop 3 there is new interface introduce called Abortable and S3AFileSystem has this capability which is implemented by on top S3's multipart upload. So when the file is committed a POST is sent (https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html) and when aborted a DELETE will be send
(https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html)
Attachments
Issue Links
- is depended upon by
-
SPARK-38445 Are hadoop committers used in Structured Streaming?
- Open
- links to