Description
Investigation of REEF-1325 shows a weird sequence of events on local runtime:
- evaluator crashes with an unhandled exception (shown in evaluator.stderr and .stdout files).
- driver receives IFailedEvaluator event which doesn't have associated FailedTask object.
- the task continues running and completes successfully
- driver receives ICompletedTask event.
By design, failed evaluator shouldn't allow for a successful task completion.
This can be reproduced using TestPoisonedEvaluatorStartHanlder test.
Update:
The root cause is due to the Evaluator not properly closing itself and allowing the Exception to propagate upwards. This results in the RuntimeStopHandler not being invoked, and provided that the user's ITask is spun off as a fire-and-forget System.Threading.Task, its execution is independent from the main Evaluator thread. This means that when the ITask finishes, it will send a Heartbeat back to the Driver that it completed, even though in reality the Evaluator has already failed. The fix catches the Evaluator failure and propagates the Exception to RuntimeStopHandler, as well as properly closes off the ContextManager and HeartbeatManager once the Exception surfaces.
Attachments
Issue Links
- is related to
-
REEF-1223 IMRU Fault Tolerance - restart failed evaluators
- Resolved
- links to