Details
-
Improvement
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.11.2, 1.12.0
Description
Currently, when the watcher(pods watcher, configmap watcher) is closed with exception, we will call WatchCallbackHandler#handleFatalError. And this could cause JobManager terminating and then failover.
For most cases, this is correct. But not for "too old resource version" exception. See more information here[1]. Usually this exception could happen when the APIServer is restarted. And we just need to create a new watch and continue to do the pods/configmap watching. This could help the Flink cluster reducing the impact of K8s cluster restarting.
The issue is inspired by this technical article[2]. Thanks the guys from tencent for the debugging. Note this is a Chinese documentation.
[1]. https://stackoverflow.com/questions/61409596/kubernetes-too-old-resource-version
Attachments
Issue Links
- relates to
-
FLINK-33728 Do not rewatch when KubernetesResourceManagerDriver watch fail
- Closed
- links to
- mentioned in
-
Page Loading...