Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6492

SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.3.0, 1.4.0
    • 1.4.0
    • Spark Core
    • None

    Description

      A deadlock can occur when DAGScheduler death causes a SparkContext to be shut down while user code is concurrently racing to stop the SparkContext in a finally block.

      For example:

      try {
            sc = new SparkContext("local", "test")
            // start running a job that causes the DAGSchedulerEventProcessor to crash
            someRDD.doStuff()
          }
      } finally {
        sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes the above job to fail with an exception
      }
      

      This leads to a deadlock. The event processor thread tries to lock on the SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK and becomes blocked because the thread that holds that lock is waiting for the event processor thread to join:

      "dag-scheduler-event-loop" daemon prio=5 tid=0x00007ffa69456000 nid=0x9403 waiting for monitor entry [0x00000001223ad000]
         java.lang.Thread.State: BLOCKED (on object monitor)
      	at org.apache.spark.SparkContext.stop(SparkContext.scala:1398)
      	- waiting to lock <0x00000007f5037b08> (a java.lang.Object)
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412)
      	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52)
      
      "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 tid=0x00007ffa69864800 nid=0x5903 in Object.wait() [0x00000001202dc000]
         java.lang.Thread.State: WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	- waiting on <0x00000007f4b28000> (a org.apache.spark.util.EventLoop$$anon$1)
      	at java.lang.Thread.join(Thread.java:1281)
      	- locked <0x00000007f4b28000> (a org.apache.spark.util.EventLoop$$anon$1)
      	at java.lang.Thread.join(Thread.java:1355)
      	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79)
      	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352)
      	at org.apache.spark.SparkContext.stop(SparkContext.scala:1405)
      	- locked <0x00000007f5037b08> (a java.lang.Object)
      [...]
      

      Attachments

        Activity

          People

            ilganeli Ilya Ganelin
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: