Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14527

Job can't finish when restart all nodemanages with using external shuffle services

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Shuffle, Spark Core, YARN
    • Labels:
      None

      Description

      1) Submit a wordcount app
      2) Stop all nodenamages when running ShuffleMapStage
      3) After some minutes, start all nodemanages

      Now, this job will failed at ResultStage and then retry ShuffleMapStage, and then ResultStage failed again, it sill running in this loop, and can't finish this job.

      This is because when stop all NMs, all the containers are still alive, but executors info will lost which stored on NM(YarnShuffleService), so even if all the NMs recover, the tasks will failed on ResultStage when fetch shuffle data.

      16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220, spark-1): FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0, mapId=4, reduceId=2, message=
      org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: Executor is not registered (appId=application_1459927459378_0005, execId=3)
      ...
      16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have all completed, from pool
      16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at wordcountWithSave.scala:21) and ResultStage 1 (saveAsTextFile at wordcountWithSave.scala:32) due to fetch failure
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Sephiroth-Lin Weizhong
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: