Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
Currently, the JobManager is the single point of failure in the Flink system. If it fails, then your job cannot be recovered and the Flink cluster is no longer able to receive new jobs.
Therefore, it is crucial to make the JobManager fault tolerant so that the Flink cluster can recover from failed JobManager. As a first step towards this goal, I propose to make the JobManager highly available by starting multiple instances and using Apache ZooKeeper to elect a leader. The leader is responsible for the execution of the Flink job.
In case that the JobManager dies, one of the other running JobManager will be elected as the leader and take over the role of the leader. The Client and the TaskManager will automatically detect the new JobManager by querying the ZooKeeper cluster.
Note that this does not achieve full fault tolerance for the JobManager but it allows the cluster to recover from failed JobManager. The design of high-availability for the JobManager is tracked in the wiki here [1].
Resources:
[1] https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability