[FLINK-2289] Make JobManager highly available - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Currently, the JobManager is the single point of failure in the Flink system. If it fails, then your job cannot be recovered and the Flink cluster is no longer able to receive new jobs.

Therefore, it is crucial to make the JobManager fault tolerant so that the Flink cluster can recover from failed JobManager. As a first step towards this goal, I propose to make the JobManager highly available by starting multiple instances and using Apache ZooKeeper to elect a leader. The leader is responsible for the execution of the Flink job.

In case that the JobManager dies, one of the other running JobManager will be elected as the leader and take over the role of the leader. The Client and the TaskManager will automatically detect the new JobManager by querying the ZooKeeper cluster.

Note that this does not achieve full fault tolerance for the JobManager but it allows the cluster to recover from failed JobManager. The design of high-availability for the JobManager is tracked in the wiki here [1].

Resources:
[1] https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability

Attachments

Activity

People

Assignee:: Till Rohrmann

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jun/15 08:57

Updated:: 29/Jun/15 12:20

Resolved:: 29/Jun/15 12:20