Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-15788 Various Kubernetes integration improvements
  3. FLINK-15836

Throw fatal error in KubernetesResourceManager when the pods watcher is closed with exception

    XMLWordPrintableJSON

    Details

      Description

      As the discussion in the PR[1], if the watchReconnectLimit is configured by users via java properties or environment, the watch may be stopped and all the changes will not be processed properly. So we need to throw a fatal exception in KubernetesResourceManager when the old one is closed with exception.

       

      > Why do we not create a new watcher in KubernetesResourceManager when old one closed exceptionally?

      After checking the WatchConnectionManager implementation in fabric8 kubernetes client, if the web socket closed exceptionally, it will check the reconnectLimit and schedule a reconnect if needed. And when reconnect successfully, the currentReconnectAttempt will reset to 0. By default, it will retry forever. When the users explicitly specify the reconnectLimit, we should respect it.
      Another reason is the the web socket closed exceptionally is usually because of network problems or port abuse. In such situation, it is better to fail the jobmanager pod and retry in a new one.

       

      [1]. https://github.com/apache/flink/pull/10965#discussion_r373491974

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                fly_in_gis Yang Wang
                Reporter:
                fly_in_gis Yang Wang
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m