Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14485

Task finished cause fetch failure when its executor has already been removed by driver

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 1.3.1, 1.5.2
    • None
    • Spark Core
    • None

    Description

      Now, when executor is removed by driver with heartbeats timeout, driver will re-queue the task on this executor and send a kill command to cluster to kill this executor.
      But, in a situation, the running task of this executor is finished and return result to driver before this executor killed by kill command sent by driver. At this situation, driver will accept the task finished event and ignore speculative task and re-queued task. But, as we know, this executor has removed by driver, the result of this finished task can not save in driver because the BlockManagerId has also removed from BlockManagerMaster by driver. So, the result data of this stage is not complete, and then, it will cause fetch failure.

      For example, the following is the task log:

      2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing executor 322 with no recent heartbeats: 256015 ms exceeds timeout 250000 ms
      2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 256015 ms
      2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing tasks for 322 from TaskSet 107.0
      2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): ExecutorLostFailure (executor 322 lost)
      2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 322 (epoch 11)
      2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 322 from BlockManagerMaster.
      2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 322 successfully in removeExecutor
      
      2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 229.0 in stage 107.0 (TID 10384) in 272315 ms on BJHC-HERA-16168.hadoop.jd.local (579/700)
      2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring task-finished event for 229.1 in stage 107.0 because task 229 has already completed successfully
      
      2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at mapPartitions at Exchange.scala:137)
      2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task set 107.1 with 3 tasks
      2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, PROCESS_LOCAL, 3745 bytes)
      2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 3745 bytes)
      2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, PROCESS_LOCAL, 3745 bytes)
      

      Driver will check the stage's result is not complete, and submit missing task, but this time, the next stage has run because previous stage has finish for its task is all finished although its result is not complete.

      2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
      2015-12-31 04:40:13 INFO org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 11
      2015-12-31 04:40:13 INFO at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
      2015-12-31 04:40:13 INFO at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
      2015-12-31 04:40:13 INFO at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      2015-12-31 04:40:13 INFO at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      2015-12-31 04:40:13 INFO at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
      2015-12-31 04:40:13 INFO at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
      2015-12-31 04:40:13 INFO at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
      2015-12-31 04:40:13 INFO at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
      2015-12-31 04:40:13 INFO at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
      2015-12-31 04:40:13 INFO at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:172)
      2015-12-31 04:40:13 INFO at org.apache.spark.shuffle.sort.SortShuffleReader.computeShuffleBlocks(SortShuffleReader.scala:301)
      2015-12-31 04:40:13 INFO at org.apache.spark.shuffle.sort.SortShuffleReader.read(SortShuffleReader.scala:111)
      2015-12-31 04:40:13 INFO at org.apache.spark.shuffle.sort.MixedShuffleReader.read(MixedShuffleReader.scala:41)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      2015-12-31 04:40:13 INFO at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
      2015-12-31 04:40:13 INFO at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      2015-12-31 04:40:13 INFO at org.apache.spark.scheduler.Task.run(Task.scala:64)
      2015-12-31 04:40:13 INFO at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
      2015-12-31 04:40:13 INFO at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      2015-12-31 04:40:13 INFO at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      2015-12-31 04:40:13 INFO at java.lang.Thread.run(Thread.java:745)
      2015-12-31 04:40:13 INFO 
      

      As the task log show, in this situation, it will casue FetchFailedException.

      Attachments

        Activity

          People

            iward iward
            iward iward
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: