Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Handles communications between driver and evaluators for evaluator and task recovery when some evaluators fail. The following describe a flow for an example:
Here is the control flow in normal scenario:
a. All the task, context and task status information is maintained in Task Manager when tasks are created at the first time
b. Task1, task2, Task3 s are queued in Task Starter
c. When all tasks in a group is ready, tasks are submitted
d. When tasks start running, task status is updated in Task Manager
e. Evaluator 3 failed
f. Driver received failed evaluator event and report it to Evaluator Manager
g. Task Manager update task status to set task3 as failed
h. Driver send message to task1 and task2 to stop them and update task status in Task Manager
i. Driver request a new evaluator3’ for failed evaluator and submit a new context3’ for it and add a new task3’ to the queue
j. Driver recreate task1’ and task2’ with existing context1 and context2 add them to the queue
k. When all the new tasks in the communication group are ready, start tasks as in step c.
Attachments
Issue Links
- contains
-
REEF-1553 Clean up exception handling in streaming utilities
- Resolved
-
REEF-1551 Make master task id in group communication writeable
- Resolved
-
REEF-1552 Make IllegalStateException and InjectionException serializable
- Resolved
- is blocked by
-
REEF-1305 Moving the communication group creation before submitting tasks and decouple evaluator/context requests from task creation
- Resolved
-
REEF-1224 IMRU Fault Tolerance - Separate Data downloading from Task injection
- Resolved
-
REEF-1320 Creating default communication group in passive way
- Resolved
- Is contained by
-
REEF-1223 IMRU Fault Tolerance - restart failed evaluators
- Resolved