[SPARK-4609] Job can not finish if there is one bad slave in clusters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

If there is one bad machine in the cluster, the executor will keep die (such as out of space in the disk), some task may be scheduled to this machines multiple times, then the job will failed after several failures of one task.

14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID 1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60 lost)
14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID 1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61 lost)
14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID 1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62 lost)
14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 lost)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 lost)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

The task should not be scheduled to a machines for more than one times. Also, if one machine failed with executor lost, it should be put in black list for some time, then try again.

cc kayousterhout matei

Attachments

Issue Links

duplicates

SPARK-6183 Skip bad workers when re-launching executors

Resolved

is related to

SPARK-6183 Skip bad workers when re-launching executors

Resolved

links to

[Github] Pull Request #3541 (davies)

Activity

People

Assignee:: Unassigned

Reporter:: Davies Liu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Nov/14 00:45

Updated:: 27/Jun/15 05:42

Resolved:: 27/Jun/15 05:42