Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3463

Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.23.1
    • Fix Version/s: 0.23.1
    • Component/s: applicationmaster, mrv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn 4 Node cluster.
      First Ran Randowriter/Sort/Sort-validate successfully
      Then again sort, when job was 50% complete
      Login node running AppMaster, and killed AppMaster with kill -9
      On Client side failed with following:

      11/11/23 10:57:27 INFO mapreduce.Job:  map 58% reduce 8%
      11/11/23 10:57:27 INFO mapred.ClientServiceDelegate: Failed to contact AM/History for job job_1322040898409_0005 retrying..
      11/11/23 10:57:28 INFO mapreduce.Job:  map 0% reduce 0%
      11/11/23 10:57:37 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=UNDEFINED. Redirecting to job history server
      11/11/23 10:57:37 INFO client.ClientTokenSelector: Looking for a token with service <RM Host>:Port
      11/11/23 10:57:37 INFO client.ClientTokenSelector: Token kind is YARN_CLIENT_TOKEN and the token's service name is <New AM Host>:Port
      11/11/23 10:57:38 WARN mapred.ClientServiceDelegate: Error from remote end: Unknown job job_1322040898409_0005
      RemoteTrace: 
       at Local Trace: 
      	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Unknown job job_1322040898409_0005
      	at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:151)
      	at $Proxy10.getTaskAttemptCompletionEvents(Unknown Source)
      	at org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClientImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBClientImpl.java:172)
      	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:273)
      	at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:320)
      	at org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:438)
      	at org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:621)
      	at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1231)
      	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1179)
      	at org.apache.hadoop.examples.Sort.run(Sort.java:181)
      	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
      	at org.apache.hadoop.examples.Sort.main(Sort.java:192)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
      	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
      	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.apache.hadoop.util.RunJar.main(RunJar.java:189)
      

      On lookig RM logs found second AM was also lauched, it was saying -:

      011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1322040898409_0005_000002 State change from RUNNING to FINISHED
      2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1322040898409_0005 of type ATTEMPT_FINISHED
      2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1322040898409_0005 State change from RUNNING to FINISHED
      2011-11-23 10:57:37,737 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application appattempt_1322040898409_0005_000002 is done. finalState=FINISHED
      

      Now looking at AM logs and found Second AM was shutdown gracefully due to :-

      2011-11-23 10:57:37,640 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService: Sending assigned event to attempt_1322040898409_0005_m_000000_0
      2011-11-23 10:57:37,641 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
      java.lang.IllegalArgumentException: Invalid NodeId [<NMHostName>]. Expected host:port
              at org.apache.hadoop.yarn.util.ConverterUtils.toNodeId(ConverterUtils.java:144)
              at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.sendAssignedEvent(RecoveryService.java:410)
              at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$InterceptingEventHandler.handle(RecoveryService.java:314)
              at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:1010)
              at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl$RequestContainerTransition.transition(TaskAttemptImpl.java:985)
              at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
              at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:851)
              at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:128)
              at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:853)
              at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDispatcher.handle(MRAppMaster.java:845)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116)
              at org.apache.hadoop.mapreduce.v2.app.recover.RecoveryService$RecoveryDispatcher.dispatch(RecoveryService.java:270)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
              at java.lang.Thread.run(Thread.java:619)
      2011-11-23 10:57:37,642 INFO [CompositeServiceShutdownHook for org.apache.hadoop.mapreduce.v2.app.MRAppMaster] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler
      
      1. MR3463_v1.txt
        1 kB
        Siddharth Seth
      2. MR3463_v2.txt
        11 kB
        Siddharth Seth

        Issue Links

          Activity

          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mahadev konar made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Fix Version/s 0.23.1 [ 12318883 ]
          Resolution Fixed [ 1 ]
          Siddharth Seth made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Siddharth Seth made changes -
          Attachment MR3463_v2.txt [ 12505698 ]
          Siddharth Seth made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Mahadev konar made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Mahadev konar made changes -
          Assignee Siddharth Seth [ sseth ]
          Siddharth Seth made changes -
          Attachment MR3463_v1.txt [ 12504978 ]
          Siddharth Seth made changes -
          Link This issue blocks MAPREDUCE-3450 [ MAPREDUCE-3450 ]
          Karam Singh made changes -
          Field Original Value New Value
          Summary AM fails to restart when first AM is killed Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job
          Karam Singh created issue -

            People

            • Assignee:
              Siddharth Seth
              Reporter:
              Karam Singh
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development