[SPARK-28650] Fix the guarantee of ForeachWriter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Documentation
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.3
Fix Version/s: 2.4.5
Component/s: Structured Streaming
Labels:
None

Description

Right now ForeachWriter has the following guarantee:


If the streaming query is being executed in the micro-batch mode, then every partition
represented by a unique tuple (partitionId, epochId) is guaranteed to have the same data.
Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally commit data
and achieve exactly-once guarantees.

But we can break this easily actually when restarting a query but a batch is re-run (e.g., upgrade Spark)

Source returns a different DataFrame that has a different partition number (e.g., we start to not create empty partitions in Kafka Source V2).
A new added optimization rule may change the number of partitions in the new run.
Change the file split size in the new run.

Since we cannot guarantee that the same (partitionId, epochId) has the same data. We should update the document for "ForeachWriter".

Attachments

Issue Links

links to

GitHub Pull Request #25407

Activity

People

Assignee:: Jungtaek Lim

Reporter:: Shixiong Zhu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Aug/19 20:48

Updated:: 20/Aug/19 16:14

Resolved:: 20/Aug/19 08:00