[SPARK-25331] Structured Streaming File Sink duplicates records in case of driver failure - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: Structured Streaming
Labels:
None

Description

Lets assume FileStreamSink.addBtach is called and an appropriate job has been started by FileFormatWrite.write and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating FileStreamSink.addBtach will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the processed batch will be written twice.

Steps needed:

call FileStreamSink.addBtach
make the ManifestFileCommitProtocol fail to finish its commitJob
call FileStreamSink.addBtach with the same data
make the ManifestFileCommitProtocol finish its commitJob successfully
Verify file output - according to Sink.addBatch documentation the rdd should be written only once