Details
-
Documentation
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.3
-
None
Description
Right now ForeachWriter has the following guarantee:
If the streaming query is being executed in the micro-batch mode, then every partition represented by a unique tuple (partitionId, epochId) is guaranteed to have the same data. Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally commit data and achieve exactly-once guarantees.
But we can break this easily actually when restarting a query but a batch is re-run (e.g., upgrade Spark)
- Source returns a different DataFrame that has a different partition number (e.g., we start to not create empty partitions in Kafka Source V2).
- A new added optimization rule may change the number of partitions in the new run.
- Change the file split size in the new run.
Since we cannot guarantee that the same (partitionId, epochId) has the same data. We should update the document for "ForeachWriter".
Attachments
Issue Links
- links to