[FLINK-4911] Non-disruptive JobManager Failures via Reconciliation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Later
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None

Description

JobManager failures can be handled in a non-disruptive way - by reconciling the new JobManager leader and the TaskManagers.
I suggest to use this term (reconcile) - it has been uses also by other frameworks (like Mesos) for non-disruptive handling of failures.

The basic approach is the following:

When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect to the JobManager
On connect, the TaskManager tells the JobManager about its currently running tasks
A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs the ExecutionGraph state from these reports
Tasks whose status was not reconstructed in a certain time are assumed failed and trigger regular task recovery.

To avoid having to re-implement this for the new JobManager / TaskManager approach in flip-6, I suggest to directly implement this into the flip-6 feature branch.