Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21685

Flink JobManager failed to restart from checkpoint in kubernetes HA setup

    XMLWordPrintableJSON

    Details

      Description

      We use Flink K8S session cluster with HA mode (1 JobManager and 4 TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink JobManager failed to recover job from checkpoint

      2021-03-08 13:16:42,962 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 
      2021-03-08 13:16:42,962 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 
      2021-03-08 13:16:42,962 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 1. 
      2021-03-08 13:16:43,014 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 9a534b2e309b24f78866b65d94082ead located at s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1. 
      2021-03-08 13:16:43,023 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master state to restore 
      2021-03-08 13:16:43,024 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2 for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 
      2021-03-08 13:16:43,046 INFO  org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] - JobManager runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 at akka.tcp://flink@10.2.179.12:6123/user/rpc/jobmanager_2. 
      2021-03-08 13:16:43,060 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host 
      2021-03-08 13:16:43,060 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@10.2.174.188:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.2.174.188:6123]] Caused by: [java.net.NoRouteToHostException: No route to host]
      

      Attached is the log, and our configuration.

       

        Attachments

        1. jstack.jm.1
          180 kB
          Yang Wang

          Issue Links

            Activity

              People

              • Assignee:
                fly_in_gis Yang Wang
                Reporter:
                petrizhang Peng Zhang
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: