Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Later
-
None
-
None
-
None
Description
JobManager failures can be handled in a non-disruptive way - by reconciling the new JobManager leader and the TaskManagers.
I suggest to use this term (reconcile) - it has been uses also by other frameworks (like Mesos) for non-disruptive handling of failures.
The basic approach is the following:
- When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect to the JobManager
- On connect, the TaskManager tells the JobManager about its currently running tasks
- A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs the ExecutionGraph state from these reports
- Tasks whose status was not reconstructed in a certain time are assumed failed and trigger regular task recovery.
To avoid having to re-implement this for the new JobManager / TaskManager approach in flip-6, I suggest to directly implement this into the flip-6 feature branch.
Attachments
Issue Links
- links to