Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.9.1
-
None
Description
We are trying to restore our flink job from check-points, and run into AskTimeoutException related failures at a high frequency. Our environment is Hadoop 2.7.1 + Yarn + Flink 1.9.1.
We hit this issue in 9 out of 10 runs, and were able to restore the application from given check-points from time to time. As the application can be restored, the check-point files shall not be corrupted. It seems that the issue is that jobmaster got timeout when it handles job submission request.
Below is the exception stack trace, it is thrown from
2020-03-05 00:04:14,360 ERROR org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Unhandled exception: httpRequest uri:/v1/jobs, context: ChannelHandlerContext(org.apache.flink.runtime.rest.handler.router.RouterHandler_ROUTED_HANDLER, [id: 0xc39aca33, L:/10.1.85.22:41000 - R:/10.1.16.251:44]) akka.pattern.AskTimeoutException: Ask timed out on Actor[akka://flink/user/dispatcher#-34498396] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635
at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235
at java.lang.Thread.run(Thread.java:748 undefined)
Attachments
Issue Links
- is fixed by
-
FLINK-16866 Make job submission non-blocking
- Closed
- is related to
-
FLINK-16018 Improve error reporting when submitting batch job (instead of AskTimeoutException)
- Resolved