Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6325

YarnAllocator crash with dynamic allocation on

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.3.0
    • 1.3.1, 1.4.0
    • Spark Core, YARN
    • None

    Description

      Run spark-shell like this:

      spark-shell --conf spark.shuffle.service.enabled=true \
      --conf spark.dynamicAllocation.enabled=true  \
      --conf spark.dynamicAllocation.minExecutors=1  \
      --conf spark.dynamicAllocation.maxExecutors=20 \
      --conf spark.dynamicAllocation.executorIdleTimeout=10  \
      --conf spark.dynamicAllocation.schedulerBacklogTimeout=5  \
      --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5
      

      Then run this simple test:

      scala> val verySmallRdd = sc.parallelize(1 to 10, 10).map { i => 
           |   if (i % 2 == 0) { Thread.sleep(30 * 1000); i } else 0
           | }
      verySmallRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:21
      
      scala> verySmallRdd.collect()
      

      When Spark starts ramping down the number of allocated executors, it will hit an assert in YarnAllocator.scala:

      assert(targetNumExecutors >= 0, "Allocator killed more executors than are allocated!")
      

      This assert will cause the akka backend to die, but not the AM itself. So the app will be in a zombie-like state, where the driver is alive but can't talk to the AM. Sadness ensues.

      I have a working fix, just need to add unit tests. Stay tuned.

      Thanks to wypoon for finding the problem, and for the test case.

      Attachments

        Activity

          People

            vanzin Marcelo Masiero Vanzin
            vanzin Marcelo Masiero Vanzin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: