Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-14038

ExecutionGraph deploy failed due to akka timeout

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      When launching the flink application, the following error was reported, I downloaded the operator logs, but still have no clue. The operator logs provided no useful information and was cancelled directly.

      JobManager logs:

      java.lang.IllegalStateException: Update task on TaskManager container_e860_1567429198842_571077_01_000006 @ zjy-hadoop-prc-st320.bj (dataPort=50990) failed due to:
      	at org.apache.flink.runtime.executiongraph.Execution.lambda$sendUpdatePartitionInfoRpcCall$14(Execution.java:1395)
      	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
      	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
      	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
      	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
      	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
      	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
      	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
      	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
      	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
      	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
      	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
      	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
      	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
      	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
      	at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
      	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
      	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
      	at akka.actor.ActorCell.invoke(ActorCell.scala:561)
      	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
      	at akka.dispatch.Mailbox.run(Mailbox.scala:225)
      	at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
      	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      Caused by: java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
      	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
      	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
      	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
      	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
      	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
      	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
      	at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
      	at akka.dispatch.OnComplete.internal(Future.scala:263)
      	at akka.dispatch.OnComplete.internal(Future.scala:261)
      	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
      	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
      	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
      	at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
      	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
      	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
      	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
      	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
      	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
      	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
      	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
      	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
      	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
      	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
      	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
      	at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
      	at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
      	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
      	... 9 more
      

      operator logs:

      2019-09-09 18:34:06,867 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Received task Partition (4/5).
      2019-09-09 18:34:06,868 INFO  org.apache.flink.runtime.taskmanager.Task                     - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched from CREATED to DEPLOYING.
      2019-09-09 18:34:06,870 INFO  org.apache.flink.runtime.taskmanager.Task                     - Creating FileSystem stream leak safety net for task Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING]
      2019-09-09 18:34:06,870 INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading JAR files for task Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING].
      2019-09-09 18:34:06,871 INFO  org.apache.flink.runtime.taskmanager.Task                     - Registering task at network: Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING].
      2019-09-09 18:34:07,075 INFO  org.apache.flink.runtime.taskmanager.Task                     - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched from DEPLOYING to RUNNING.
      2019-09-09 18:34:07,255 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Received task Sort-Partition (4/5).
      2019-09-09 18:34:07,258 INFO  org.apache.flink.runtime.taskmanager.Task                     - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) switched from CREATED to DEPLOYING.
      2019-09-09 18:34:07,261 INFO  org.apache.flink.runtime.taskmanager.Task                     - Creating FileSystem stream leak safety net for task Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING]
      2019-09-09 18:34:07,261 INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading JAR files for task Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING].
      2019-09-09 18:34:07,263 INFO  org.apache.flink.runtime.taskmanager.Task                     - Registering task at network: Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING].
      2019-09-09 18:34:07,303 INFO  org.apache.flink.runtime.taskmanager.Task                     - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) switched from DEPLOYING to RUNNING.
      2019-09-09 18:34:54,625 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task DataSource (at org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) (8c6262b3f802f82d60a1999f2e040a68).
      2019-09-09 18:34:54,806 INFO  org.apache.flink.runtime.taskmanager.Task                     - DataSource (at org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) (8c6262b3f802f82d60a1999f2e040a68) switched from RUNNING to CANCELING.
      2019-09-09 18:34:54,806 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code DataSource (at org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) (8c6262b3f802f82d60a1999f2e040a68).
      
      

      I checked the network and it's good. so maybe there are some problems with the taskManager? 

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            liupengcheng liupengcheng
            liupengcheng liupengcheng
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 40m
                40m

                Slack

                  Issue deployment