Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
-
None
Description
When a job is recovered by a standby job manager and the recovery of the checkpoint state or job fails, the job might be eventually removed by the job manager after all retries are exhausted. This leads to the removal of the job/checkpoint state in ZooKeeper and the state backend, making it impossible to ever recover the job again.
We should never exhaust job retries in the HA case.