Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-22014

Flink JobManager failed to restart after failure in kubernetes HA setup

    XMLWordPrintableJSON

Details

    Description

      After the JobManager pod failed and the new one started, it was not able to recover jobs due to the absence of recovery data in storage - config map pointed at not existing file.
       
      Due to this the JobManager pod entered into the `CrashLoopBackOff`state and was not able to recover - each attempt failed with the same error so the whole cluster became unrecoverable and not operating.
       
      I had to manually delete the config map and start the jobs again without the save point.
       
      If I tried to emulate the failure further by deleting job manager pod manually, the new pod every time recovered well and issue was not reproducible anymore artificially.
       
      Below is the failure log:

      2021-03-26 08:22:57,925 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Starting the SlotManager.
       2021-03-26 08:22:57,928 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
      {configMapName='stellar-flink-cluster-dispatcher-leader'}.
       2021-03-26 08:22:57,931 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
      2021-03-26 08:22:57,933 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
       2021-03-26 08:22:58,029 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Stopping SessionDispatcherLeaderProcess.
       2021-03-26 08:28:22,677 INFO org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred in the cluster entrypoint. java.util.concurrent.CompletionException: org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
         at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
         at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) [?:?]
         at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) [?:?]
         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
         at java.lang.Thread.run(Unknown Source) [?:?] Caused by: org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
      Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
         at org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
      Caused by: java.io.FileNotFoundException: No such file or directory: s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737
         at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255 undefined) ~[?:?]
         at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149 undefined) ~[?:?]
         at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088 undefined) ~[?:?]
         at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699 undefined) ~[?:?]
         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950 undefined) ~[?:?]
         at org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131 undefined) ~[?:?]
         at org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37 undefined) ~[?:?]
         at org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
         at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113 undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
      

      Attachments

        1. scalyr-logs (1).txt
          240 kB
          Mikalai Lushchytski
        2. image-2021-04-19-11-17-58-215.png
          236 kB
          Mikalai Lushchytski
        3. flink-logs.txt.zip
          186 kB
          Mikalai Lushchytski

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mlushchytski Mikalai Lushchytski
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: