Github user zhijiangW commented on the issue:
Hi @tillrohrmann , thank you for reviews and positive suggestions!
I try to explain the root case of this issue first:
From JobMaster side, it sends the cancel rpc message and gets the acknowledge from TaskExecutor, then the execution state transition to *CANCELING*.
From TaskExecutor side, it would notify the final state to JobMaster before task exits. The *notifyFinalState* can be divided into two steps:
- Execute the *RunAsync* message by akka actor and this is a tell action, and it will trigger to run *unregisterTaskAndNotifyFinalState*.
- In process of *unregisterTaskAndNotifyFinalState, it will trigger the rpc message of **updateTaskExecutionState* , and it is a ask action, so the mechanism can avoid lost message.
The problem is that it may cause OOM before trigger *updateTaskExecutionState, and this error is caught by **AkkaRpcActor* and does not do anything resulting in interrupting the following process. The *updateTaskExecutionState* will not be executed anymore.
For the key point interaction between TaskExecutor and JobMaster, it should not tolerate lost message, and I agree with your above suggestions. So there may be two ideas for this improvement:
- Enhance the robustness of *notifyFinalState*, and the current rethrow OOM is an easy option, but it will cause the TaskExecutor exit，there should be other ways to make the cost reduction.
- After get cancel acknowledge in JobMaster side, it will trigger a timeout to check the execution final state. If the execution has not entered the final state within timeout, the JobMaster can resend the acknowledge message to TaskExecutor to confirm the status.
And I prefers the first way to just make it sense in one side, avoid the complex interaction between TaskExecutor and JobMaster.
Wish your further suggestions or any ideas.