Details

    • Sub-task
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.23.0
    • 0.23.0
    • mrv2
    • None
    • Reviewed

    Description

      Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn cluster.
      Sumbitted Sleep Job of 100K maps tasks as following -:
      $HADOOP_COMMON_HOME/bin/hadoop jar $HADOOP_MAPRED_HOME/hadoop-test.jar sleep -m 100000 -r 0 -mt 1000 -rt 1000

      when around 53K tasks go, login node running AppMaster, and killed AppMaster with kill -9

      Resource Manager tried restart AM uptio max-retris but failed with following -:

      11/10/19 15:29:09 INFO mapreduce.Job: Job job_1319036155027_0002 failed with state FAILED due to: Application
      application_1319036155027_0002 failed 5 times due to AM Container for appattempt_1319036155027_0002_000005 exited with 
      exitCode: -1000 due to: RemoteTrace: 
      java.io.IOException: Resource
      hdfs://<NN>:<PORT>/user/<JOBUSER>/.staging/job_1319036155027_0002/appTokens changed on src
      filesystem (expected 1319037705427, was 1319037714496
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.copy(FSDownload.java:80)
                  at
      org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.access$000(FSDownload.java:49)
                  at
      org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload$1.run(FSDownload.java:149)
                  at
      org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload$1.run(FSDownload.java:147)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:396)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152)
                  at
      org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.call(FSDownload.java:145)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.call(FSDownload.java:49)
                  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                  at java.lang.Thread.run(Thread.java:619)
       at LocalTrace: 
                  org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Resource
      hdfs://<NN>:<PORT>/user/<JOBUSER>/.staging/job_1319036155027_0002/appTokens changed on src
      filesystem (expected 1319037705427, was 1319037714496
                  at
      org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
                  at
      org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
                  at
      org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:798)
                  at
      org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:483)
                  at
      org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:228)
                  at
      org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
                  at
      org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
                  at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:343)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1486)
                  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1482)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:396)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152)
                  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1480)
      
      .Failing this attempt.. Failing the application.
      11/10/19 15:29:09 INFO mapreduce.Job: Counters: 0
      

      Attachments

        1. MAPREDUCE-3233.patch
          1 kB
          Mahadev Konar

        Issue Links

          Activity

            People

              mahadev Mahadev Konar
              karams Karam Singh
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: