Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-20417

Handle "Too old resource version" exception in Kubernetes watch more gracefully

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      Currently, when the watcher(pods watcher, configmap watcher) is closed with exception, we will call WatchCallbackHandler#handleFatalError. And this could cause JobManager terminating and then failover.

      For most cases, this is correct. But not for "too old resource version" exception. See more information here[1]. Usually this exception could happen when the APIServer is restarted. And we just need to create a new watch and continue to do the pods/configmap watching. This could help the Flink cluster reducing the impact of K8s cluster restarting.

       

      The issue is inspired by this technical article[2]. Thanks the guys from tencent for the debugging. Note this is a Chinese documentation.

       

      [1]. https://stackoverflow.com/questions/61409596/kubernetes-too-old-resource-version

      [2]. https://cloud.tencent.com/developer/article/1731416

        Attachments

          Activity

            People

            • Assignee:
              fly_in_gis Yang Wang
              Reporter:
              fly_in_gis Yang Wang

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment