Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.3.1, 1.4.1, 1.5.2
-
None
Description
Currently, when max number of executor failures reached the maxNumExecutorFailures, ApplicationMaster will be killed and re-register another one.This time, YarnAllocator will be created a new instance.
But, the value of property executorIdCounter in YarnAllocator will reset to 0. Then the Id of new executor will starting from 1. This will confuse with the executor has already created before, which will cause FetchFailedException.
For example, the following is the task log:
2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 172.22.92.14:45125 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@BJHC-HERA-16217.hadoop.jd.local:46538/user/Executor#-790726793]) with ID 1
Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337 ), shuffleId=5, mapId=2, reduceId=3, message= 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl e_5_2_0.index (No such file or directory) 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84) 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 2015-12-22 02:43:20 INFO at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) 2015-12-22 02:43:20 INFO at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154) 2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149) 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
As the task log show, the executor id of BJHC-HERA-16217.hadoop.jd.local is the same as BJHC-HERA-17030.hadoop.jd.local. So, it is confusion and cause FetchFailedException.
And this situation of executorId conflict is just in yarn client mode due to driver not running on yarn.