Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-22255 AdaptiveScheduler improvements/bugs
  3. FLINK-25277

Introduce explicit shutdown signalling between TaskManager and JobManager

    XMLWordPrintableJSON

Details

    • Reduce down-scaling delay in reactive mode through signalling between TaskManager and JobManager

    Description

      We need to introduce shutdown signalling between TaskManager and JobManager for fast & graceful shutdown in reactive scheduler mode.

      In Flink 1.14 and earlier versions, the JobManager tracks the availability of a TaskManager using a hearbeat. This heartbeat is by default configured with an interval of 10 seconds and a timeout of 50 seconds [1]. Hence, the shutdown of a TaskManager is recognized only after about 50-60 seconds. This works fine for the static scheduling mode, where a TaskManager only disappears as part of a cluster shutdown or a job failure. However, in the reactive scheduler mode (FLINK-10407), TaskManagers are regularly added and removed from a running job. Here, the heartbeat-mechanisms incurs additional delays.

      To remove these delays, we add an explicit shutdown signal from the TaskManager to the JobManager.

       

      [1]https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout

      Attachments

        Issue Links

          Activity

            People

              nsemmler Niklas Semmler
              nsemmler Niklas Semmler
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 504h
                  504h
                  Remaining:
                  Remaining Estimate - 504h
                  504h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified