Description
Currently, if an elevator fails while we are still in the phase of task submission, we will have an issue where the newly created tasks will wait in WaitForRegistration in Group communication initialization until timeout.
A way to do it is to cancel the task that is in constructing. The issue is the driver has not received IRunningTask yet at this time therefore there is no way to send event to the task with the current system.
Another way is to add a context layer for group communication initialization. Let Driver/GroupCommuDriver to control if all such contexts are created based on the context event. Then submitting tasks on those contexts. This would keep the control for group communications in a centralized place. It would also makes task initialization much quicker and reduce the chance to get failures in task constructor before task is running.
Attachments
Issue Links
- Is contained by
-
REEF-1223 IMRU Fault Tolerance - restart failed evaluators
- Resolved