Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.3.1
-
None
-
None
Description
Lets assume FileStreamSink.addBtach is called and an appropriate job has been started by FileFormatWrite.write and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating FileStreamSink.addBtach will result in duplicate writing of the data
In the event the driver fails after the executors start processing the job the processed batch will be written twice.
Steps needed:
- call FileStreamSink.addBtach
- make the ManifestFileCommitProtocol fail to finish its commitJob
- call FileStreamSink.addBtach with the same data
- make the ManifestFileCommitProtocol finish its commitJob successfully
- Verify file output - according to Sink.addBatch documentation the rdd should be written only once
I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-19633 FileSource read from FileSink
- Resolved
- links to