Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.0
-
None
Description
To reproduce, let Spark to throw an OOM Exception in event loop:
scala> spark.sparkContext.getConf.get("spark.driver.memory") res0: String = 1g scala> val a = new Array[Int](4 * 1000 * 1000) scala> val ds = spark.createDataset(a) scala> ds.rdd.zipWithIndex [Stage 0:> (0 + 0) / 3]Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space [Stage 0:> (0 + 0) / 3] // Spark is not responding
While not responding, Spark waiting for some Promise, but is never done.
The promise depends some process in event loop thread, but the thread is dead when Fatal exception is thrown.
"main" #1 prio=5 os_prio=31 tid=0x00007ffc9300b000 nid=0x1703 waiting on condition [0x0000700000216000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007ad978eb8> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at org.apache.spark.rdd.ZippedWithIndexRDD.<init>(ZippedWithIndexRDD.scala:50) at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293) at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292)
I don't know how to fix it properly, but it seems we need to add Fatal error handling to EventLoop.run() in core/EventLoop.scala