[SPARK-12864] Fetch failure from AM restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.1, 1.4.1, 1.5.2
Fix Version/s: 2.0.0
Component/s: Spark Core, YARN
Labels:
None

Description

Currently, when max number of executor failures reached the maxNumExecutorFailures, ApplicationMaster will be killed and re-register another one.This time, YarnAllocator will be created a new instance.
But, the value of property executorIdCounter in YarnAllocator will reset to 0. Then the Id of new executor will starting from 1. This will confuse with the executor has already created before, which will cause FetchFailedException.
For example, the following is the task log:

2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 172.22.92.14:45125
2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])

2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@BJHC-HERA-16217.hadoop.jd.local:46538/user/Executor#-790726793]) with ID 1

Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
), shuffleId=5, mapId=2, reduceId=3, message=
2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
e_5_2_0.index (No such file or directory)
2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
2015-12-22 02:43:20 INFO at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
2015-12-22 02:43:20 INFO at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
2015-12-22 02:43:20 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

As the task log show, the executor id of BJHC-HERA-16217.hadoop.jd.local is the same as BJHC-HERA-17030.hadoop.jd.local. So, it is confusion and cause FetchFailedException.

And this situation of executorId conflict is just in yarn client mode due to driver not running on yarn.

Attachments

Issue Links

links to

[Github] Pull Request #10794 (zhonghaihua)

Activity

People

Assignee:: iward

Reporter:: iward

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Jan/16 12:49

Updated:: 17/May/20 18:17

Resolved:: 01/Apr/16 21:25