[NEMO-50] Carefully retry tasks in the scheduler - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.1
Labels:
- pull-request-available

Description

An executor failure results in loss of local data blocks (e.g., LocalFileStore, MemoryStore), and interruption of tasks that were running at the time of failure. Then, the tasks who produced the lost blocks and the tasks that were interrupted become eligible for re-execution.

Given this situation, the scheduler should figure out (1) which tasks really need to be re-executed, and (2) in what order they should be re-executed. For example, if all downstream tasks of a lost block have completed and their outputs are safe, then we don't need to retry the producer of that lost block. We should also retry tasks in the order of their dependencies to prevent the deadlock situation where executor slots are filled with downstream tasks waiting for upstream tasks that are waiting for an available slot.

Attachments

Issue Links

blocks

NEMO-54 Handle remote data fetch failures due to executor removal

Resolved

NEMO-55 Handle NCS Master-to-Executor RPC failures

Resolved

NEMO-119 Include ScheduleGroup in PhysicalPlan, Make Scheduler ScheduleGroup-topology Aware

Open

NEMO-136 Rename SchedulerRunner to TaskDispatcher

Resolved

is blocked by

NEMO-49 Replace failed executor with a new executor

Resolved

NEMO-52 Refactor Nemo physical plan

Resolved

NEMO-122 Manage Task/Stage/Job states in one place

Resolved

NEMO-123 Replace PendingTaskCollection with pointers to ScheduleGroups

Resolved

links to

GitHub Pull Request #59

(3 is blocked by, 1 links to)

Activity

People

Assignee:: John Yang

Reporter:: John Yang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/May/18 08:37

Updated:: 23/Nov/18 05:53

Resolved:: 29/Jun/18 07:40