Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28650

Fix the guarantee of ForeachWriter

    XMLWordPrintableJSON

    Details

    • Type: Documentation
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.3
    • Fix Version/s: 2.4.5
    • Component/s: Structured Streaming
    • Labels:
      None

      Description

      Right now ForeachWriter has the following guarantee:

      
      If the streaming query is being executed in the micro-batch mode, then every partition
      represented by a unique tuple (partitionId, epochId) is guaranteed to have the same data.
      Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally commit data
      and achieve exactly-once guarantees.
      
      

       

      But we can break this easily actually when restarting a query but a batch is re-run (e.g., upgrade Spark)

      • Source returns a different DataFrame that has a different partition number (e.g., we start to not create empty partitions in Kafka Source V2).
      • A new added optimization rule may change the number of partitions in the new run.
      • Change the file split size in the new run.

      Since we cannot guarantee that the same (partitionId, epochId) has the same data. We should update the document for "ForeachWriter".

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kabhwan Jungtaek Lim
                Reporter:
                zsxwing Shixiong Zhu
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: