Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.9.3, 1.10.2, 1.11.3, 1.12.0
Description
After FLINK-12364, no mater how many times of asynchronous part of checkpoint on task failed, the job itself would not fail by default:
Default behavior | Flink-1.5 —> Flink-1.8 | Flink-1.9 -> Flink-1.12 |
---|---|---|
Synchronous part of checkpoint at task failed | Job failed | Job failed |
Asynchronous part of checkpoint at task failed | Job failed | Job would not fail |
This error was because StreamTask use Exception instead of CheckpointException when async part failed as decline message. Thus checkpoint coordinator would call failPendingCheckpointDueToTaskFailure(pendingCheckpoint, CheckpointFailureReason.JOB_FAILURE, cause, executionAttemptID) to process the declined checkpoint:
if (cause == null) { failPendingCheckpointDueToTaskFailure(pendingCheckpoint, CheckpointFailureReason.CHECKPOINT_DECLINED, executionAttemptID); } else if (cause instanceof CheckpointException) { CheckpointException exception = (CheckpointException) cause; failPendingCheckpointDueToTaskFailure(pendingCheckpoint, exception.getCheckpointFailureReason(), cause, executionAttemptID); } else { failPendingCheckpointDueToTaskFailure(pendingCheckpoint, CheckpointFailureReason.JOB_FAILURE, cause, executionAttemptID); }
However, CheckpointFailureManager would ignore the JOB_FAILURE reason and not count this failed checkpoint, which causes asynchronous checkpoint failure would not fail the job anymore.
FLINK-16753 corrects the misleading message of JOB_FAILURE but the asynchronous checkpoint failure still cannot fail the job.
As this bug exists too long, I decide to set it as critical instead of blocker level.
Attachments
Issue Links
- causes
-
FLINK-21215 Checkpoint was declined because one input stream is finished
- Closed
-
FLINK-21244 UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnLocalAndRemoteChannel Fail
- Closed
- is caused by
-
FLINK-12364 Introduce a CheckpointFailureManager to centralized manage checkpoint failure
- Closed
- links to