Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12864

Fetch failure from AM restart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.1, 1.4.1, 1.5.2
    • 2.0.0
    • Spark Core, YARN
    • None

    Description

      Currently, when max number of executor failures reached the maxNumExecutorFailures, ApplicationMaster will be killed and re-register another one.This time, YarnAllocator will be created a new instance.
      But, the value of property executorIdCounter in YarnAllocator will reset to 0. Then the Id of new executor will starting from 1. This will confuse with the executor has already created before, which will cause FetchFailedException.
      For example, the following is the task log:

      2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 172.22.92.14:45125
      2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
      
      2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@BJHC-HERA-16217.hadoop.jd.local:46538/user/Executor#-790726793]) with ID 1
      
      Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
      ), shuffleId=5, mapId=2, reduceId=3, message=
      2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
      e_5_2_0.index (No such file or directory)
      2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
      2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
      2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
      2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      2015-12-22 02:43:20 INFO at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
      2015-12-22 02:43:20 INFO at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
      2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
      2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
      2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
      2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
      2015-12-22 02:43:20 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      

      As the task log show, the executor id of BJHC-HERA-16217.hadoop.jd.local is the same as BJHC-HERA-17030.hadoop.jd.local. So, it is confusion and cause FetchFailedException.

      And this situation of executorId conflict is just in yarn client mode due to driver not running on yarn.

      Attachments

        Activity

          People

            iward iward
            iward iward
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: