Currently, when the watcher(pods watcher, configmap watcher) is closed with exception, we will call WatchCallbackHandler#handleFatalError. And this could cause JobManager terminating and then failover.
For most cases, this is correct. But not for "too old resource version" exception. See more information here. Usually this exception could happen when the APIServer is restarted. And we just need to create a new watch and continue to do the pods/configmap watching. This could help the Flink cluster reducing the impact of K8s cluster restarting.
The issue is inspired by this technical article. Thanks the guys from tencent for the debugging. Note this is a Chinese documentation.