Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10046

RM failed to transition to Active because of App recovery throwing java.lang.NullPointerException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.0.0
    • None
    • resourcemanager
    • None

    Description

       

      CDH Distribution: Hadoop 3.0.0-cdh6.0.1

      The exception stack is as follows.

      2019-12-12 17:09:41,422 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
      java.lang.NullPointerException
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:521)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1221)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1265)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1206)
      at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:907)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:116)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:1046)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2000(RMAppImpl.java:118)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1110)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1051)
      at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:875)
      at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:357)
      at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:544)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1393)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)
      at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1146)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1186)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1182)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1726)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1182)
      at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
      at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
      at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:894)
      at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
      at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
      at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:491)

       

      During the recovery of Application attempts, the status of one app attempt is NULL. The following LOG  describes:

      2019-12-12 17:09:41,381 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1576136386231_0742 with 1 attempts and final state = NONE2019-12-12 17:09:41,381 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1576136386231_0742 with 1 attempts and final state = NONE

       

      The corresponding code in rmapp/RMAppImpl.java is

      if (recoveredFinalState == null)

      { LOG.info(String.format(RECOVERY_MESSAGE, getApplicationId(), appState.getAttemptCount(), "NONE")); }

      else if (LOG.isDebugEnabled())

      { LOG.debug(String.format(RECOVERY_MESSAGE, getApplicationId(), appState.getAttemptCount(), recoveredFinalState)); }

       

      In rmapp/attempt/RMAppAttemptImpl.java, there is a piece of code using the RMAppAttemptState, which is NULL.

       

      private static class BaseFinalTransition extends BaseTransition {

      private final RMAppAttemptState finalAttemptState;

      public BaseFinalTransition(RMAppAttemptState finalAttemptState)

      { this.finalAttemptState = finalAttemptState; }

      @Override
      public void transition(RMAppAttemptImpl appAttempt,
      RMAppAttemptEvent event) {
      ApplicationAttemptId appAttemptId = appAttempt.getAppAttemptId();

      // Tell the AMS. Unregister from the ApplicationMasterService
      appAttempt.masterService.unregisterAttempt(appAttemptId);

      // Tell the application and the scheduler
      ApplicationId applicationId = appAttemptId.getApplicationId();
      RMAppEvent appEvent = null;
      boolean keepContainersAcrossAppAttempts = false;
      switch (finalAttemptState) {
      case FINISHED:
      {

      In the switch clause,  java.lang.NullPointerException is thrown because finalAttemptState is NULL.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yxing Yong Xing
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: