[FLINK-2287] Implement JobManager high availability - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.10.0
Component/s: Runtime / Coordination
Labels:
None

Description

The problem: The JobManager (JM) is a single point of failure. When it crashes, TaskManagers (TM) fail all running jobs and try to reconnect to the same JM. A failed JM looses all state and can not resume the running jobs; even if it recovers and the TMs reconnect.

Solution: implement JM fault tolerance/high availability by having multiple JM instances running with one as leader and the other(s) in standby. The exact coordination and state update protocol between JM, TM, and clients is covered in sub-tasks/issues.

Related Wiki: https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability

Attachments

Issue Links

is depended upon by

FLINK-2340 Provide standalone mode for web interface of the JobManager

Closed

is superceded by

FLINK-7106 Make SubmittedJobGraphStore implementation configurable

Closed

Sub-Tasks

1.	Setup ZooKeeper for distributed coordination	Closed	Ufuk Celebi
2.	Use ZooKeeper to elect JobManager leader and send information to TaskManagers	Closed	Till Rohrmann
3.	Refactor RPCs from within the ExecutionGraph	Closed	Till Rohrmann
4.	Assign session IDs to JobManager and TaskManager messages	Closed	Till Rohrmann
5.	Recover running jobs on JobManager failure	Resolved	Ufuk Celebi
6.	Add high availability support for Yarn	Resolved	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Ufuk Celebi

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Jun/15 08:51

Updated:: 28/Feb/19 14:01

Resolved:: 20/Oct/15 12:56