Currently, if Kubernetes/Yarn does not have enough resources to fulfill Flink's resource requirement, there will be pending pod/container requests on Kubernetes/Yarn. These pending resource requirements are never cleared until either fulfilled or the Flink cluster is shutdown.
However, sometimes Flink no longer needs the pending resources. E.g., the slot request is then fulfilled by another slots that become available, or the job failed due to slot request timeout (in a session cluster). In such cases, Flink does not remove the resource request until the resource is allocated, then it discovers that it no longer needs the allocated resource and release them. This would affect the underlying Kubernetes/Yarn cluster, especially when the cluster is under heavy workload.
It would be good for Flink to cancel pod/container requests as earlier as possible if it can discover that some of the pending workers are no longer needed.
There are several approaches potentially achieve this.
- We can always check whether there's a pending worker that can be canceled when a PendingTaskManagerSlot is unassigned.
- We can have a separate timeout for requesting new worker. If the resource cannot be allocated within the given time since requested, we should cancel that resource request and claim a resource allocation failure.
- We can share the same timeout for starting new worker (proposed in FLINK-13554). This is similar to 2), but it requires the worker to be registered, rather than allocated, before timeout.