Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
I have experienced the following problem with pulse updates. To reproduce:
1. Create an update with a pulse timeout of 1h
2. Send a pulse to get the update going.
3. Failover the scheduler immediately after.
4. Observe that the update is awaiting another pulse right after the failover.
This is because the JobUpdateControllerImpl stores pulse history and state in memory in PulseHandler. On scheduler startup, the pulse state is reset to no pulse received.
We can solve this by inferring the timestamp of the last pulse by inspecting the job update events.