[FLINK-18625] Maintain redundant taskmanagers to speed up failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.12.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

When flink job fails because of killed taskmanagers, it will request new containers when restarting. Requesting new containers can be very slow, sometimes it takes dozens of seconds even more. The reasons can be different, for example, yarn and hdfs are slow, machine performance is poor. In some product scenario, SLA is high and failover should be in seconds.

To speed up the recovery process, we can maintain redundant slots in advance. When job restarts, it can use the redundant slots at once instead of requesting new taskmanagers.

The implemention can be done in SlotManagerImpl. Below is a brief description:

In construct method, init redundantTaskmanagerNum from config.
In method start(), allocate redundant taskmanagers.
In method start(), Change taskManagerTimeoutCheck() to checkValidTaskManagers().
In method checkValidTaskManagers(), manage redundant taskmanagers and timeout taskmanagers. The idle taskmanager number must be not less than redundantTaskmanagerNum.

If less, allocate from resourceManager until equal.
If more, release timeout taskmanagers but keep at least redundantTaskmanagerNum idle taskmanagers.

Attachments

Issue Links

is related to

FLINK-18760 Redundant task managers should be released when there's no job running in session cluster

Closed

links to

GitHub Pull Request #12958

Activity

People

Assignee:: Liu

Reporter:: Liu

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 17/Jul/20 10:58

Updated:: 30/Nov/21 20:38

Resolved:: 30/Jul/20 07:57