Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4127

RM fail with noAuth error if switched from failover mode to non-failover mode

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The scenario is that RM failover was initially enabled, so the zkRootNodeAcl is by default set with the RM ID in the ACL string

      If RM failover is then switched to be disabled, it cannot load data from ZK and fail with noAuth error. After I reset the root node ACL, it again can access.

      15/09/08 14:28:34 ERROR resourcemanager.ResourceManager: Failed to load/recover state
      org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
        at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
        at org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)
        at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)
        at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
        at org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$SafeTransaction.commit(ZKRMStateStore.java:1009)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.safeSetData(ZKRMStateStore.java:985)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getAndIncrementEpoch(ZKRMStateStore.java:374)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:579)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:973)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1014)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1010)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1010)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1050)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1194)
      

      the problem may be that in non-failover mode, RM doesn't use the RM-ID to connect with ZK and thus fail with no Auth error.

      We should be able to switch failover on and off with no interruption to the user.

        Attachments

        1. YARN-4127.01.patch
          6 kB
          Varun Saxena
        2. YARN-4127.02.patch
          6 kB
          Varun Saxena
        3. YARN-4127-branch-2.7.01.patch
          8 kB
          Varun Saxena
        4. YARN-4127-branch-2.7.02.patch
          8 kB
          Varun Saxena

          Activity

            People

            • Assignee:
              varun_saxena Varun Saxena
              Reporter:
              jianhe Jian He
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: