Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
The scenario is like this:
JobMaster tries to cancel all the executions when process failed execution, and the task executor already acknowledge the cancel rpc message.
When notify the final state in TaskExecutor, it causes OOM in AkkaRpcActor and this error is caught to log the info. The final state will not be sent any more.
The JobMaster can not receive the final state and trigger the restart strategy.
One solution is to catch the OutOfMemoryError and throw it, then it will cause to shut down the ActorSystem resulting in exiting the TaskExecutor. The JobMaster can be notified of TaskExecutor failure and fail all the tasks to trigger restart successfully.
Attachments
Issue Links
- links to