[SPARK-19326] Speculated task attempts do not get launched in few scenarios - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.2, 2.1.0
Fix Version/s: 2.3.0
Component/s: Scheduler, Spark Core
Labels:
None

Description

Speculated copies of tasks do not get launched in some cases.

Examples:

All the running executors have no CPU slots left to accommodate a speculated copy of the task(s). If the all running executors reside over a set of slow / bad hosts, they will keep the job running for long time
`spark.task.cpus` > 1 and the running executor has not filled up all its CPU slots. Since the speculated copies of tasks should run on different host and not the host where the first copy was launched.

In both these cases, `ExecutorAllocationManager` does not know about pending speculation task attempts and thinks that all the resource demands are well taken care of. (relevant code)

This adds variation in the job completion times and more importantly SLA misses In prod, with a large number of jobs, I see this happening more often than one would think. Chasing the bad hosts or reason for slowness doesn't scale.

Here is a tiny repro. Note that you need to launch this with (Mesos or YARN or standalone deploy mode) along with `--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`

val n = 100
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index == 1) {
  Thread.sleep(Long.MaxValue)  // fake long running task(s)
}
it.toList.map(x => index + ", " + x).iterator
}).collect

Attachments

Issue Links

causes

SPARK-28403 Executor Allocation Manager can add an extra executor when speculative tasks

Resolved

links to

[Github] Pull Request #18492 (janewangfb)

GitHub Pull Request #18492

Activity

People

Assignee:: Unassigned

Reporter:: Tejas Patil

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 22/Jan/17 03:45

Updated:: 17/May/20 17:46

Resolved:: 23/Aug/17 03:32