Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40039

Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • Structured Streaming
    • None

    Description

      Currently on S3 the checkpoint file manager (called FileContextBasedCheckpointFileManager) is based on rename. So when a file is opened for an atomic stream a temporary file used instead and when the stream is committed the file is renamed.

      But on S3 a rename will be a file copy. So it has some serious performance implication.

      But on Hadoop 3 there is new interface introduce called Abortable and S3AFileSystem has this capability which is implemented by on top S3's multipart upload. So when the file is committed a POST is sent (https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html) and when aborted a DELETE will be send
      (https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html)

      Attachments

        Activity

          People

            attilapiros Attila Zsolt Piros
            attilapiros Attila Zsolt Piros
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: