Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.7.1
-
None
-
Reviewed
Description
The scenario is that RM failover was initially enabled, so the zkRootNodeAcl is by default set with the RM ID in the ACL string
If RM failover is then switched to be disabled, it cannot load data from ZK and fail with noAuth error. After I reset the root node ACL, it again can access.
15/09/08 14:28:34 ERROR resourcemanager.ResourceManager: Failed to load/recover state org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth at org.apache.zookeeper.KeeperException.create(KeeperException.java:113) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159) at org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44) at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129) at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) at org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$SafeTransaction.commit(ZKRMStateStore.java:1009) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.safeSetData(ZKRMStateStore.java:985) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getAndIncrementEpoch(ZKRMStateStore.java:374) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:579) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:973) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1014) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1010) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1010) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1050) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1194)
the problem may be that in non-failover mode, RM doesn't use the RM-ID to connect with ZK and thus fail with no Auth error.
We should be able to switch failover on and off with no interruption to the user.