Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-2289

Make JobManager highly available

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Currently, the JobManager is the single point of failure in the Flink system. If it fails, then your job cannot be recovered and the Flink cluster is no longer able to receive new jobs.

      Therefore, it is crucial to make the JobManager fault tolerant so that the Flink cluster can recover from failed JobManager. As a first step towards this goal, I propose to make the JobManager highly available by starting multiple instances and using Apache ZooKeeper to elect a leader. The leader is responsible for the execution of the Flink job.

      In case that the JobManager dies, one of the other running JobManager will be elected as the leader and take over the role of the leader. The Client and the TaskManager will automatically detect the new JobManager by querying the ZooKeeper cluster.

      Note that this does not achieve full fault tolerance for the JobManager but it allows the cluster to recover from failed JobManager. The design of high-availability for the JobManager is tracked in the wiki here [1].

      Resources:
      [1] https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability

      Attachments

        Activity

          People

            trohrmann Till Rohrmann
            trohrmann Till Rohrmann
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: