Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-18785

flink goes into dead lock leader election when restoring from a do-not-exist checkpoint/savepoint path

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Information Provided
    • 1.10.0, 1.10.1
    • None
    • None
    • flink on yarn

      flink-1.10.x

      jdk8

      flink-conf.yaml yarn.application-attempts: 2 (or just delete this config)

      yarn-2.7.2

    Description

      flink goes into dead lock leader election when restoring from a do-not-exist checkpoint/savepoint path.

      I just run this cmd:
      bin/flink run -m yarn-cluster  -s "hdfs:///do/not/exist/path" examples/streaming/
      WindowJoin.jar
      when i visit UI,i meet this:

      in flink-1.9.3, the program just exits. But in 1.10.x, it stucks in leader election

       

      Here is the stack trace in `jobmanager.err`:

      ERROR ConnectionState Authentication failed
      ERROR ClusterEntrypoint Fatal error occurred in the cluster entrypoint.
      org.apache.flink.runtime.dispatcher.DispatcherException: Could not start recovered job 94b0911af12b61d3ee905xxxxxxxxbaf1.
      at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$handleRecoveredJobStartError$0(Dispatcher.java:218)
      at org.apache.flink.runtime.dispatcher.Dispatcher$$Lambda$128/130098676.apply(Unknown Source)
      at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
      at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
      at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
      at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
      at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:739)
      at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
      at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
      at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
      at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
      at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
      at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$60/278409878.apply(Unknown Source)
      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
      at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
      at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
      at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
      at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
      at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
      at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
      at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
      at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
      at akka.actor.ActorCell.invoke(ActorCell.scala:561)
      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
      at akka.dispatch.Mailbox.run(Mailbox.scala:225)
      at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
      at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
      at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
      at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
      at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1584)
      at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
      at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
      ... 4 more
      Caused by: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
      at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
      at org.apache.flink.util.function.CheckedSupplier$$Lambda$125/1775358927.get(Unknown Source)
      at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1582)
      ... 6 more
      Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
      at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:152)
      at org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)
      at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:381)
      at org.apache.flink.runtime.dispatcher.Dispatcher$$Lambda$124/874035545.get(Unknown Source)
      at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
      ... 8 more
      Caused by: java.io.FileNotFoundException: Cannot find checkpoint or savepoint file/directory 'hdfs:///path/do/not/exist' on file system 'hdfs'.
      at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:243)
      at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110)
      at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1152)
      at org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:307)
      at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:240)
      at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:216)
      at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120)
      at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105)
      at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
      at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
      at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
      at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
      at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146)
      ... 12 more

      Attachments

        1. flink_savepoint_path_do_not_exits.jpg
          223 kB
          Kai Chen
        2. jobmanager.log.attemp2-13
          45 kB
          Kai Chen
        3. jobmanager.log.attemp1
          27 kB
          Kai Chen
        4. image-2020-07-31-19-04-19-241.png
          6 kB
          Kai Chen

        Activity

          People

            Unassigned Unassigned
            yuchuanchen Kai Chen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: