Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.14.0, 1.13.2
Description
In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat times out. However, it does not actively release the pod / container of that TM. The releasing of pod / container relies on the TM to terminate itself after failing to re-register to the RM.
In some rare conditions, the TM process may not terminate and hang out for long time. In such cases, k8s / yarn sees the process running, thus will not release the pod / container. Neither will Flink's resource manager. Consequently, the resource is leaked until the entire application is terminated.
To fix this, we should make ActiveResourceManager to actively release the resource to K8s / Yarn after a TM heartbeat timeout.
Attachments
Issue Links
- links to