Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24377

TM resource may not be properly released after heartbeat timeout

    XMLWordPrintableJSON

Details

    Description

      In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat times out. However, it does not actively release the pod / container of that TM. The releasing of pod / container relies on the TM to terminate itself after failing to re-register to the RM.

      In some rare conditions, the TM process may not terminate and hang out for long time. In such cases, k8s / yarn sees the process running, thus will not release the pod / container. Neither will Flink's resource manager. Consequently, the resource is leaked until the entire application is terminated.

      To fix this, we should make ActiveResourceManager to actively release the resource to K8s / Yarn after a TM heartbeat timeout.

      Attachments

        Issue Links

          Activity

            People

              xtsong Xintong Song
              xtsong Xintong Song
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: