[FLINK-33728] Do not rewatch when KubernetesResourceManagerDriver watch fail - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: 1.20.0
Component/s: Deployment / Kubernetes
Labels:
- pull-request-available

Description

I met massive production problem when kubernetes ETCD slow responding happen. After Kube recoverd after 1 hour, Thousands of Flink jobs using kubernetesResourceManagerDriver rewatched when recieving ResourceVersionTooOld, which caused great pressure on API Server and made API server failed again...

I am not sure is it necessary to

getResourceEventHandler().onError(throwable)

in PodCallbackHandlerImpl# handleError method?

We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called. And we can leverage on the akka heartbeat timeout to discover the TM failure, just like YARN mode do.

Attachments

Issue Links

is related to

FLINK-20417 Handle "Too old resource version" exception in Kubernetes watch more gracefully

Closed

links to

GitHub Pull Request #24163

Activity

People

Assignee:: xiaogang zhou

Reporter:: xiaogang zhou

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Dec/23 14:38

Updated:: 22/Feb/24 01:21

Resolved:: 22/Feb/24 01:20