Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.2
-
None
-
Linux 3.2.0-23-generic x86_64
Description
The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker.
Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)
2014-08-14 21:44:59,519] ERROR (akka.actor.OneForOneStrategy)
java.lang.NullPointerException
at org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
How to reproduce: kill all Spark processes when running Spark standalone on a cluster on some cluster node, where driver runs (kill driver, master and worker simultaneously).