[FLINK-17127] Make pod creating retry interval configurable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: Deployment / Kubernetes
Labels:
None

Description

Follow the discussion in this PR[1].

In the current implementation, the POD_CREATION_RETRY_INTERVAL is set to fixed value with "3s", which means when creating a taskmanager pod failed, we will schedule a delay retry in 3s. It could work for most cases. However, we still have a risk that too many retried of different Flink clusters will flood to Kubernetes api server. So we need to add an initial and max setting for retry interval, similar to NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX.

We could add an ExponentialBackoff for the retry policy. The backoff could be reset to initial value when a new TaskManager created successfully after several retries.

Inspired by ~~FLINK-17176~~, the pod crashed exceptionally, we should also set the retry interval to avoid the requests floods to K8s api server. But it could be done in a separate ticket/PR.

[1]. https://github.com/apache/flink/pull/11427#discussion_r406318451

Attachments

Issue Links

is fixed by

FLINK-10868 Flink's JobCluster ResourceManager doesn't use maximum-failed-containers as limit of resource acquirement

Closed

is related to

FLINK-17176 Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Yang Wang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 14/Apr/20 05:04

Updated:: 23/Feb/21 03:31

Resolved:: 23/Feb/21 03:31