Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22714

Spark API Not responding when Fatal exception occurred in event loop

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • Spark Core

    Description

      To reproduce, let Spark to throw an OOM Exception in event loop:

      scala> spark.sparkContext.getConf.get("spark.driver.memory")
      res0: String = 1g
      scala> val a = new Array[Int](4 * 1000 * 1000)
      scala> val ds = spark.createDataset(a)
      scala> ds.rdd.zipWithIndex
      [Stage 0:>                                                          (0 + 0) / 3]Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space
      [Stage 0:>                                                          (0 + 0) / 3]
      // Spark is not responding
      

      While not responding, Spark waiting for some Promise, but is never done.
      The promise depends some process in event loop thread, but the thread is dead when Fatal exception is thrown.

      "main" #1 prio=5 os_prio=31 tid=0x00007ffc9300b000 nid=0x1703 waiting on condition [0x0000700000216000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000007ad978eb8> (a scala.concurrent.impl.Promise$CompletionLatch)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
              at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
              at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
              at org.apache.spark.rdd.ZippedWithIndexRDD.<init>(ZippedWithIndexRDD.scala:50)
              at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
              at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
              at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
              at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
              at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
              at org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292)
      

      I don't know how to fix it properly, but it seems we need to add Fatal error handling to EventLoop.run() in core/EventLoop.scala

      Attachments

        Activity

          People

            Unassigned Unassigned
            todesking todesking
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: