Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8358

ResourceManager restart fail to recover due to TimelineServiceV1Publisher NPE

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.9.1
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None
    • Environment:

      Ubuntu 16.04

      java version "1.8.0_91"

      Description

      I'm upgrading from Hadoop 2.7.3 to 2.9.1. ResourceManager restart works fine for 2.7.3, but fails on 2.9.1.

      I'm using LevelDB as the RM state store, the problem seems related to TimelineServiceV1Publisher. If I set yarn.resourcemanager.system-metrics-publisher.enabled to false, then recovery works fine. But if the option is set to true, RM fails to start with the following log:

       

      2018-05-24 23:11:54,597 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery started
      2018-05-24 23:11:54,673 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Loaded RM state version info 1.1
      2018-05-24 23:11:54,688 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: Recovered 12 RM delegation token master keys
      2018-05-24 23:11:54,688 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: Recovered 0 RM delegation tokens
      2018-05-24 23:11:54,990 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: Recovered 2099 applications and 2100 application attempts
      2018-05-24 23:11:54,998 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: Recovered 0 reservations
      2018-05-24 23:11:54,998 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: recovering RMDelegationTokenSecretManager.
      2018-05-24 23:11:55,003 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 2099 applications
      2018-05-24 23:11:55,107 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully recovered 0 out of 2099 applications
      2018-05-24 23:11:55,108 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
      java.lang.NullPointerException
      at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.appCreated(TimelineServiceV1Publisher.java:90)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.sendATSCreateEvent(RMAppImpl.java:1954)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:931)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1061)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1054)
      at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
      at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:878)
      at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:339)
      at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:533)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1394)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)
      at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1147)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1187)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1183)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1183)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1223)
      at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1422)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                cyfdecyf Chen Yufei
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: