Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6834

MR application fails with "No NMToken sent" exception after MRAppMaster recovery

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 2.7.0
    • Fix Version/s: None
    • Component/s: resourcemanager, yarn
    • Labels:
      None
    • Environment:

      Centos 7

      Description

      Steps to reproduce:
      1) Submit MR application (for example PI app with 50 containers)
      2) Find MRAppMaster process id for the application
      3) Kill MRAppMaster by kill -9 command

      Expected: ResourceManager launch new MRAppMaster container and MRAppAttempt and application finish correctly

      Actually: After launching new MRAppMaster and MRAppAttempt the application fails with the following exception:

      2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container launch failed for container_1482408247195_0002_02_000011 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for node1:43037
      	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
      	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:244)
      	at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
      	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
      	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
      	at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      
      

      Problem:
      When RMCommunicator sends "registerApplicationMaster" request to RM, RM generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted to RMCommunicator in RegisterApplicationMasterResponse (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in RMCommunicator.register method. RM don't transmit tese tokens again for other allocated requests, but we don't have these tokens in NMTokenCache. Accordingly we get "No NMToken sent for node" exception.

      I have found that this issue appears after changes from the https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed

      I tried to do the same scenario without the commit and application completed successfully after RMAppMaster recovery

        Attachments

        1. YARN-6019.001.patch
          6 kB
          Aleksandr Balitsky

          Issue Links

            Activity

              People

              • Assignee:
                abalitsky1 Aleksandr Balitsky
                Reporter:
                abalitsky1 Aleksandr Balitsky
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated: