Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.14.5, 1.15.1
Description
After Checkpoint abort, ChannelStateWriteResult should fail.
But if channelStateWriter.start(id, checkpointOptions); is executed after Checkpoint abort, ChannelStateWriteResult will not fail.
Cause Analysis:
When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not be executed yet. These checkpointIds will be stored in the abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it will check if the checkpointId should be aborted.
ChannelStateWriter.abort(checkpointId, exception, true) should also be executed here.
The unit test can reproduce this bug.
Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it doesn't account for channelStateWriter.start after notifyCheckpointAborted.
JIRA: FLINK-17869
commit: https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e
The bug will affect the new feature FLINK-26803, because the channel state file can be closed only after the Checkpoints of all tasks of the shared file are complete or abort. So when the checkpoint of some tasks fails, if abort is not called, the file cannot be closed and all tasks sharing the file cannot execute inputChannelStateHandles.completeExceptionally(e); and resultSubpartitionStateHandles.completeExceptionally(e); , AsyncCheckpointRunnable will wait forever.
Attachments
Attachments
Issue Links
- is related to
-
FLINK-26803 Merge small ChannelState file for Unaligned Checkpoint
- Closed
- links to