Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25331

Structured Streaming File Sink duplicates records in case of driver failure

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.1
    • None
    • Structured Streaming
    • None

    Description

      Lets assume FileStreamSink.addBtach is called and an appropriate job has been started by FileFormatWrite.write and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating FileStreamSink.addBtach will result in duplicate writing of the data

      In the event the driver fails after the executors start processing the job the processed batch will be written twice.

      Steps needed:

      1. call FileStreamSink.addBtach
      2. make the ManifestFileCommitProtocol fail to finish its commitJob
      3. call FileStreamSink.addBtach with the same data
      4. make the ManifestFileCommitProtocol finish its commitJob successfully
      5. Verify file output - according to Sink.addBatch documentation the rdd should be written only once

      I have created a wip PR with a unit test:
      https://github.com/apache/spark/pull/22331

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            misutoth Mihaly Toth
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment