Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28650

Fix the guarantee of ForeachWriter

    XMLWordPrintableJSON

Details

    • Documentation
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.3
    • 2.4.5
    • Structured Streaming
    • None

    Description

      Right now ForeachWriter has the following guarantee:

      
      If the streaming query is being executed in the micro-batch mode, then every partition
      represented by a unique tuple (partitionId, epochId) is guaranteed to have the same data.
      Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally commit data
      and achieve exactly-once guarantees.
      
      

       

      But we can break this easily actually when restarting a query but a batch is re-run (e.g., upgrade Spark)

      • Source returns a different DataFrame that has a different partition number (e.g., we start to not create empty partitions in Kafka Source V2).
      • A new added optimization rule may change the number of partitions in the new run.
      • Change the file split size in the new run.

      Since we cannot guarantee that the same (partitionId, epochId) has the same data. We should update the document for "ForeachWriter".

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              zsxwing Shixiong Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: