Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4127

RM fail with noAuth error if switched from failover mode to non-failover mode

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      The scenario is that RM failover was initially enabled, so the zkRootNodeAcl is by default set with the RM ID in the ACL string

      If RM failover is then switched to be disabled, it cannot load data from ZK and fail with noAuth error. After I reset the root node ACL, it again can access.

      15/09/08 14:28:34 ERROR resourcemanager.ResourceManager: Failed to load/recover state
      org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
        at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159)
        at org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44)
        at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129)
        at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
        at org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:122)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$SafeTransaction.commit(ZKRMStateStore.java:1009)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.safeSetData(ZKRMStateStore.java:985)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.getAndIncrementEpoch(ZKRMStateStore.java:374)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:579)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:973)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1014)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1010)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1010)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1050)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1194)
      

      the problem may be that in non-failover mode, RM doesn't use the RM-ID to connect with ZK and thus fail with no Auth error.

      We should be able to switch failover on and off with no interruption to the user.

      Attachments

        1. YARN-4127.01.patch
          6 kB
          Varun Saxena
        2. YARN-4127.02.patch
          6 kB
          Varun Saxena
        3. YARN-4127-branch-2.7.01.patch
          8 kB
          Varun Saxena
        4. YARN-4127-branch-2.7.02.patch
          8 kB
          Varun Saxena

        Activity

          People

            varun_saxena Varun Saxena
            jianhe Jian He
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: