Description
In stress testing, we have seen such scenario in which master/update task is completed running, and most of the other tasks are also completed, but then there was a failed Evaluator event received by driver. This would result in system moving to shut down state and starting recovery that is not necessary.
When the driver receives ICompletedTask from master task which has running state, that means we have completed the calculation and result has been written to the output. After that, if the driver ever receives FailedEvalutor/FailedTask, they should be ignored and driver should execute DoneAction to dispose all the contexts and shut down the system.
Attachments
Issue Links
- Is contained by
-
REEF-1223 IMRU Fault Tolerance - restart failed evaluators
- Resolved