Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-16429

failed to restore flink job from checkpoints due to unhandled exceptions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.9.1
    • 1.12.0
    • Runtime / Coordination
    • None

    Description

      We are trying to restore our flink job from check-points, and run into AskTimeoutException related failures at a high frequency. Our environment is Hadoop 2.7.1 + Yarn + Flink 1.9.1. 

      We hit this issue in 9 out of 10 runs, and were able to restore the application from given check-points from time to time. As the application can be restored, the check-point files shall not be corrupted. It seems that the issue is that jobmaster got timeout when it handles job submission request.  

       

      Below is the exception stack trace, it is thrown from

      https://github.com/apache/flink/blob/2ec645a5bfd3cfadaf0057412401e91da0b21873/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/AbstractHandler.java#L209

      2020-03-05 00:04:14,360 ERROR org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Unhandled exception: httpRequest uri:/v1/jobs, context: ChannelHandlerContext(org.apache.flink.runtime.rest.handler.router.RouterHandler_ROUTED_HANDLER, [id: 0xc39aca33, L:/10.1.85.22:41000 - R:/10.1.16.251:44]) akka.pattern.AskTimeoutException: Ask timed out on Actor[akka://flink/user/dispatcher#-34498396] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635
      at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635
      at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648
      at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205
      at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601
      at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109
      at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599
      at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328
      at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279
      at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283
      at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235
      at java.lang.Thread.run(Thread.java:748 undefined)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yuyang08 Yu Yang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: