[FLINK-20417] Handle "Too old resource version" exception in Kubernetes watch more gracefully - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.11.2, 1.12.0
Fix Version/s: 1.12.2, 1.13.0
Component/s: Deployment / Kubernetes
Labels:
- pull-request-available

Description

Currently, when the watcher(pods watcher, configmap watcher) is closed with exception, we will call WatchCallbackHandler#handleFatalError. And this could cause JobManager terminating and then failover.

For most cases, this is correct. But not for "too old resource version" exception. See more information here[1]. Usually this exception could happen when the APIServer is restarted. And we just need to create a new watch and continue to do the pods/configmap watching. This could help the Flink cluster reducing the impact of K8s cluster restarting.

The issue is inspired by this technical article[2]. Thanks the guys from tencent for the debugging. Note this is a Chinese documentation.

[1]. https://stackoverflow.com/questions/61409596/kubernetes-too-old-resource-version

[2]. https://cloud.tencent.com/developer/article/1731416

Attachments

Issue Links

relates to

FLINK-33728 Do not rewatch when KubernetesResourceManagerDriver watch fail

Closed

links to

GitHub Pull Request #14837

mentioned in: Page Loading...

Activity

People

Assignee:: Yang Wang

Reporter:: Yang Wang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Nov/20 08:58

Updated:: 11/Dec/23 16:33

Resolved:: 08/Feb/21 16:37