Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Follow the discussion in this PR[1].
In the current implementation, the POD_CREATION_RETRY_INTERVAL is set to fixed value with "3s", which means when creating a taskmanager pod failed, we will schedule a delay retry in 3s. It could work for most cases. However, we still have a risk that too many retried of different Flink clusters will flood to Kubernetes api server. So we need to add an initial and max setting for retry interval, similar to NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX.
We could add an ExponentialBackoff for the retry policy. The backoff could be reset to initial value when a new TaskManager created successfully after several retries.
Inspired by FLINK-17176, the pod crashed exceptionally, we should also set the retry interval to avoid the requests floods to K8s api server. But it could be done in a separate ticket/PR.
[1]. https://github.com/apache/flink/pull/11427#discussion_r406318451
Attachments
Issue Links
- is fixed by
-
FLINK-10868 Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement
- Closed
- is related to
-
FLINK-17176 Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler
- Closed