Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-28474

ChannelStateWriteResult may not fail after checkpoint abort

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      After Checkpoint abort, ChannelStateWriteResult should fail.

      But if channelStateWriter.start(id, checkpointOptions); is executed after Checkpoint abort, ChannelStateWriteResult will not fail.

       

      Cause Analysis:

      When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not be executed yet. These checkpointIds will be stored in the abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it will check if the checkpointId should be aborted.

      ChannelStateWriter.abort(checkpointId, exception, true) should also be executed here.

      The unit test can reproduce this bug.

       

      Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it doesn't account for channelStateWriter.start after notifyCheckpointAborted.

      JIRA: FLINK-17869

      commit: https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e

       

      The bug will affect the new feature FLINK-26803, because the channel state file can be closed only after the Checkpoints of all tasks of the shared file are complete or abort. So when the checkpoint of some tasks fails, if abort is not called, the file cannot be closed and all tasks sharing the file cannot execute inputChannelStateHandles.completeExceptionally(e); and resultSubpartitionStateHandles.completeExceptionally(e); , AsyncCheckpointRunnable will wait forever.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            fanrui Rui Fan
            fanrui Rui Fan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment