Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
An executor failure results in loss of local data blocks (e.g., LocalFileStore, MemoryStore), and interruption of tasks that were running at the time of failure. Then, the tasks who produced the lost blocks and the tasks that were interrupted become eligible for re-execution.
Given this situation, the scheduler should figure out (1) which tasks really need to be re-executed, and (2) in what order they should be re-executed. For example, if all downstream tasks of a lost block have completed and their outputs are safe, then we don't need to retry the producer of that lost block. We should also retry tasks in the order of their dependencies to prevent the deadlock situation where executor slots are filled with downstream tasks waiting for upstream tasks that are waiting for an available slot.
Attachments
Issue Links
- blocks
-
NEMO-54 Handle remote data fetch failures due to executor removal
- Resolved
-
NEMO-55 Handle NCS Master-to-Executor RPC failures
- Resolved
-
NEMO-119 Include ScheduleGroup in PhysicalPlan, Make Scheduler ScheduleGroup-topology Aware
- Open
-
NEMO-136 Rename SchedulerRunner to TaskDispatcher
- Resolved
- is blocked by
-
NEMO-49 Replace failed executor with a new executor
- Resolved
-
NEMO-52 Refactor Nemo physical plan
- Resolved
-
NEMO-122 Manage Task/Stage/Job states in one place
- Resolved
-
NEMO-123 Replace PendingTaskCollection with pointers to ScheduleGroups
- Resolved
- links to