FLINK-12364, no mater how many times of asynchronous part of checkpoint on task failed, the job itself would not fail by default:
|Default behavior||Flink-1.5 —> Flink-1.8||Flink-1.9 -> Flink-1.12|
|Synchronous part of checkpoint at task failed||Job failed||Job failed|
|Asynchronous part of checkpoint at task failed||Job failed||Job would not fail|
This error was because StreamTask use Exception instead of CheckpointException when async part failed as decline message. Thus checkpoint coordinator would call failPendingCheckpointDueToTaskFailure(pendingCheckpoint, CheckpointFailureReason.JOB_FAILURE, cause, executionAttemptID) to process the declined checkpoint:
However, CheckpointFailureManager would ignore the JOB_FAILURE reason and not count this failed checkpoint, which causes asynchronous checkpoint failure would not fail the job anymore.
FLINK-16753 corrects the misleading message of JOB_FAILURE but the asynchronous checkpoint failure still cannot fail the job.
As this bug exists too long, I decide to set it as critical instead of blocker level.