Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27086

DataSourceV2 MicroBatchExecution commits last batch only if new batch is constructed

    Details

      Description

      I wanted to use the new DataSourceV2 API to build a AWS SQS streaming data source which offers the new commit method of the MicroBatchReader to finally commit the message at SQS after it has been processed. If the processing of messages would fail and they got not committed, after a timeout the message would automatically reappear in SQS which is the intended behaviour without using special state storing or checkpointing.
      Sadly, I noticed that an offset in the MicroBatchReader got only committed if a new batch is constructed (see line 400 in MicroBatchExecution) which is quite strange. Especially, in my SQS example it could happen that after a first batch of messages this there is a long break before new messages are send to SQS. This would lead to a timeout and reappearance of the SQS messages from the previous batch, because they got processed, but not committed. Therefore, I would strongly recommend to commit an offset, once the batch has been processed! The committing should be independent from the next batch!

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              sebastianherold Sebastian Herold
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: