Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.13.0, 1.14.0
-
Reduce down-scaling delay in reactive mode through signalling between TaskManager and JobManager
Description
We need to introduce shutdown signalling between TaskManager and JobManager for fast & graceful shutdown in reactive scheduler mode.
In Flink 1.14 and earlier versions, the JobManager tracks the availability of a TaskManager using a hearbeat. This heartbeat is by default configured with an interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the shutdown of a TaskManager is recognized only after about 50-60 seconds. This works fine for the static scheduling mode, where a TaskManager only disappears as part of a cluster shutdown or a job failure. However, in the reactive scheduler mode (FLINK-10407), TaskManagers are regularly added and removed from a running job. Here, the heartbeat-mechanisms incurs additional delays.
To remove these delays, we add an explicit shutdown signal from the TaskManager to the JobManager.
[1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout
Attachments
Issue Links
- causes
-
FLINK-26400 Release Testing: Explicit shutdown signalling from TaskManager to JobManager
- Resolved
- fixes
-
FLINK-25749 YARNSessionFIFOSecuredITCase.testDetachedMode fails on AZP
- Closed
- links to