Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Done
-
None
Description
I met massive production problem when kubernetes ETCD slow responding happen. After Kube recoverd after 1 hour, Thousands of Flink jobs using kubernetesResourceManagerDriver rewatched when recieving ResourceVersionTooOld, which caused great pressure on API Server and made API server failed again...
I am not sure is it necessary to
getResourceEventHandler().onError(throwable)
in PodCallbackHandlerImpl# handleError method?
We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called. And we can leverage on the akka heartbeat timeout to discover the TM failure, just like YARN mode do.
Attachments
Issue Links
- is related to
-
FLINK-20417 Handle "Too old resource version" exception in Kubernetes watch more gracefully
- Closed
- links to