Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.15.3, 1.16.2, 1.17.1
Description
Currently if an error occurs while saving a completed checkpoint in the CompletedCheckpointStore, CheckpointCoordinator doesn't call CheckpointFailureManager to handle the error. Such behavior leads to the fact, that errors from CompletedCheckpointStore don't increase the failed checkpoints count and 'execution.checkpointing.tolerable-failed-checkpoints' option does not limit the number of errors of this kind in any way.
Possible solution may be to move the notification of CheckpointFailureManager about successful checkpoint after storing completed checkpoint in the CompletedCheckpointStore and providing the exception to the CheckpointFailureManager in the CheckpointCoordinator#addCompletedCheckpointToStoreAndSubsumeOldest() method.
Attachments
Issue Links
- links to