Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-28474

ChannelStateWriteResult may not fail after checkpoint abort

    XMLWordPrintableJSON

Details

    Description

      After Checkpoint abort, ChannelStateWriteResult should fail.

      But if channelStateWriter.start(id, checkpointOptions); is executed after Checkpoint abort, ChannelStateWriteResult will not fail.

       

      Cause Analysis:

      When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not be executed yet. These checkpointIds will be stored in the abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it will check if the checkpointId should be aborted.

      ChannelStateWriter.abort(checkpointId, exception, true) should also be executed here.

      The unit test can reproduce this bug.

       

      Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it doesn't account for channelStateWriter.start after notifyCheckpointAborted.

      JIRA: FLINK-17869

      commit: https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e

       

      The bug will affect the new feature FLINK-26803, because the channel state file can be closed only after the Checkpoints of all tasks of the shared file are complete or abort. So when the checkpoint of some tasks fails, if abort is not called, the file cannot be closed and all tasks sharing the file cannot execute inputChannelStateHandles.completeExceptionally(e); and resultSubpartitionStateHandles.completeExceptionally(e); , AsyncCheckpointRunnable will wait forever.

      Attachments

        Issue Links

          Activity

            People

              fanrui Rui Fan
              fanrui Rui Fan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: