[SPARK-14527] Job can't finish when restart all nodemanages with using external shuffle services - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: Shuffle, Spark Core, YARN
Labels:
None

Description

1) Submit a wordcount app
2) Stop all nodenamages when running ShuffleMapStage
3) After some minutes, start all nodemanages

Now, this job will failed at ResultStage and then retry ShuffleMapStage, and then ResultStage failed again, it sill running in this loop, and can't finish this job.

This is because when stop all NMs, all the containers are still alive, but executors info will lost which stored on NM(YarnShuffleService), so even if all the NMs recover, the tasks will failed on ResultStage when fetch shuffle data.

16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220, spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0, mapId=4, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: Executor is not registered (appId=application_1459927459378_0005, execId=3)
...
16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have all completed, from pool
16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at wordcountWithSave.scala:32) due to fetch failure

Attachments

Issue Links

duplicates

SPARK-9439 ExternalShuffleService should be robust to NodeManager restarts in yarn

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Weizhong

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 11/Apr/16 07:46

Updated:: 11/Apr/16 11:50

Resolved:: 11/Apr/16 11:50