Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.11.3, 1.12.2, 1.13.0
Description
Currently, the JobMaster removes all checkpoints after a job reaches a globally terminal state. Then it notifies the Dispatcher about the termination of the job. The Dispatcher then removes the job from the SubmittedJobGraphStore. If the Dispatcher process fails before doing that it might get restarted. In this case, the Dispatcher would still find the job in the SubmittedJobGraphStore and recover it. Since the CompletedCheckpointStore is empty, it would start executing this job from the beginning.
I think we must not remove job state before the job has not been marked as done or made inaccessible for any restarted processes. Concretely, we should first remove the job from the SubmittedJobGraphStore and only then delete the checkpoints. Ideally all the job related cleanup operation happens atomically.
Attachments
Issue Links
- is related to
-
FLINK-10333 Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints)
- Open
-
FLINK-21928 DuplicateJobSubmissionException after JobManager failover
- Closed
-
FLINK-11813 Standby per job mode Dispatchers don't know job's JobSchedulingStatus
- Closed
- relates to
-
FLINK-19816 Flink restored from a wrong checkpoint (a very old one and not the last completed one)
- Closed
- links to