Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.9.1
-
None
-
None
-
Ubuntu 16.04
java version "1.8.0_91"
Description
I'm upgrading from Hadoop 2.7.3 to 2.9.1. ResourceManager restart works fine for 2.7.3, but fails on 2.9.1.
I'm using LevelDB as the RM state store, the problem seems related to TimelineServiceV1Publisher. If I set yarn.resourcemanager.system-metrics-publisher.enabled to false, then recovery works fine. But if the option is set to true, RM fails to start with the following log:
2018-05-24 23:11:54,597 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery started
2018-05-24 23:11:54,673 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Loaded RM state version info 1.1
2018-05-24 23:11:54,688 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: Recovered 12 RM delegation token master keys
2018-05-24 23:11:54,688 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: Recovered 0 RM delegation tokens
2018-05-24 23:11:54,990 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: Recovered 2099 applications and 2100 application attempts
2018-05-24 23:11:54,998 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: Recovered 0 reservations
2018-05-24 23:11:54,998 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: recovering RMDelegationTokenSecretManager.
2018-05-24 23:11:55,003 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 2099 applications
2018-05-24 23:11:55,107 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully recovered 0 out of 2099 applications
2018-05-24 23:11:55,108 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.appCreated(TimelineServiceV1Publisher.java:90)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.sendATSCreateEvent(RMAppImpl.java:1954)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:931)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1061)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1054)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:878)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:339)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:533)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1394)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1147)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1187)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1183)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1183)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1223)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1422)
Attachments
Issue Links
- duplicates
-
YARN-8068 Application Priority field causes NPE in app timeline publish when Hadoop 2.7 based clients to 2.8+
- Resolved