[FLINK-25277] Introduce explicit shutdown signalling between TaskManager and JobManager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.13.0, 1.14.0
Fix Version/s: 1.15.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available
- reactive

Release Note:
Reduce down-scaling delay in reactive mode through signalling between TaskManager and JobManager

Description

We need to introduce shutdown signalling between TaskManager and JobManager for fast & graceful shutdown in reactive scheduler mode.

In Flink 1.14 and earlier versions, the JobManager tracks the availability of a TaskManager using a hearbeat. This heartbeat is by default configured with an interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the shutdown of a TaskManager is recognized only after about 50-60 seconds. This works fine for the static scheduling mode, where a TaskManager only disappears as part of a cluster shutdown or a job failure. However, in the reactive scheduler mode (~~FLINK-10407~~), TaskManagers are regularly added and removed from a running job. Here, the heartbeat-mechanisms incurs additional delays.

To remove these delays, we add an explicit shutdown signal from the TaskManager to the JobManager.

[1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout

Attachments

Issue Links

causes

FLINK-26400 Release Testing: Explicit shutdown signalling from TaskManager to JobManager

Resolved

fixes

FLINK-25749 YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP

Closed

links to

GitHub Pull Request #18169

GitHub Pull Request #18446

GitHub Pull Request #18948

Activity

People

Assignee:: Niklas Semmler

Reporter:: Niklas Semmler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Dec/21 14:48

Updated:: 02/Mar/22 14:05

Resolved:: 02/Mar/22 14:05

Time Tracking

Estimated:

504h

Remaining:

504h

Logged:

Not Specified