Right now ForeachWriter has the following guarantee:
But we can break this easily actually when restarting a query but a batch is re-run (e.g., upgrade Spark)
- Source returns a different DataFrame that has a different partition number (e.g., we start to not create empty partitions in Kafka Source V2).
- A new added optimization rule may change the number of partitions in the new run.
- Change the file split size in the new run.
Since we cannot guarantee that the same (partitionId, epochId) has the same data. We should update the document for "ForeachWriter".