Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-21685

Flink JobManager failed to restart from checkpoint in kubernetes HA setup

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      We use Flink K8S session cluster with HA mode (1 JobManager and 4 TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink JobManager failed to recover job from checkpoint

      2021-03-08 13:16:42,962 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 
      2021-03-08 13:16:42,962 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 
      2021-03-08 13:16:42,962 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 1. 
      2021-03-08 13:16:43,014 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 9a534b2e309b24f78866b65d94082ead located at s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1. 
      2021-03-08 13:16:43,023 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master state to restore 
      2021-03-08 13:16:43,024 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2 for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 
      2021-03-08 13:16:43,046 INFO  org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] - JobManager runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 at akka.tcp://flink@10.2.179.12:6123/user/rpc/jobmanager_2. 
      2021-03-08 13:16:43,060 WARN  akka.remote.transport.netty.NettyTransport                   [] - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host 
      2021-03-08 13:16:43,060 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink@10.2.174.188:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.2.174.188:6123]] Caused by: [java.net.NoRouteToHostException: No route to host]
      

      Attached is the log, and our configuration.

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            wangyang0918 Yang Wang
            petrizhang Peng Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment