Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19326

Speculated task attempts do not get launched in few scenarios

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.2, 2.1.0
    • 2.3.0
    • Scheduler, Spark Core
    • None

    Description

      Speculated copies of tasks do not get launched in some cases.

      Examples:

      • All the running executors have no CPU slots left to accommodate a speculated copy of the task(s). If the all running executors reside over a set of slow / bad hosts, they will keep the job running for long time
      • `spark.task.cpus` > 1 and the running executor has not filled up all its CPU slots. Since the speculated copies of tasks should run on different host and not the host where the first copy was launched.

      In both these cases, `ExecutorAllocationManager` does not know about pending speculation task attempts and thinks that all the resource demands are well taken care of. (relevant code)

      This adds variation in the job completion times and more importantly SLA misses In prod, with a large number of jobs, I see this happening more often than one would think. Chasing the bad hosts or reason for slowness doesn't scale.

      Here is a tiny repro. Note that you need to launch this with (Mesos or YARN or standalone deploy mode) along with `--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`

      val n = 100
      val someRDD = sc.parallelize(1 to n, n)
      someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
      if (index == 1) {
        Thread.sleep(Long.MaxValue)  // fake long running task(s)
      }
      it.toList.map(x => index + ", " + x).iterator
      }).collect
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tejasp Tejas Patil
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: