[SPARK-9193] Avoid assigning tasks to executors under killing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.0, 1.4.1
Fix Version/s: 1.4.2, 1.5.0
Component/s: Scheduler, Spark Core
Labels:
None

Target Version/s:

1.4.2, 1.5.0

Description

Now, when some executors are killed by dynamic-allocation, it leads to some mis-assignment onto lost executors sometimes. Such kind of mis-assignment causes task failure(s) or even job failure if it repeats that errors for 4 times.

The root cause is that killExecutors doesn't remove those executors under killing ASAP. It depends on the OnDisassociated event to refresh the active working list later. The delay time really depends on your cluster status (from several milliseconds to sub-minute). When new tasks to be scheduled during that period of time, it will be assigned to those "active" but "under killing" executors. Then the tasks will be failed due to "executor lost". The better way is to exclude those executors under killing in the makeOffers(). Then all those tasks won't be allocated onto those executors "to be lost" any more.

Attachments

Issue Links

links to

[Github] Pull Request #7528 (GraceH)

Activity

People

Assignee:: Jie Huang

Reporter:: Jie Huang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Jul/15 13:00

Updated:: 17/May/20 17:48

Resolved:: 21/Jul/15 16:36