Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27086

DataSourceV2 MicroBatchExecution commits last batch only if new batch is constructed

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      I wanted to use the new DataSourceV2 API to build a AWS SQS streaming data source which offers the new commit method of the MicroBatchReader to finally commit the message at SQS after it has been processed. If the processing of messages would fail and they got not committed, after a timeout the message would automatically reappear in SQS which is the intended behaviour without using special state storing or checkpointing.
      Sadly, I noticed that an offset in the MicroBatchReader got only committed if a new batch is constructed (see line 400 in MicroBatchExecution) which is quite strange. Especially, in my SQS example it could happen that after a first batch of messages this there is a long break before new messages are send to SQS. This would lead to a timeout and reappearance of the SQS messages from the previous batch, because they got processed, but not committed. Therefore, I would strongly recommend to commit an offset, once the batch has been processed! The committing should be independent from the next batch!

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            sebastianherold Sebastian Herold
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment