Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21993

HS/HMS delegationstore with ZK can degrade performance when jute.maxBuffer is reached

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.2.0, 2.3.4, 3.0.0, 4.0.0
    • None
    • Hive
    • None

    Description

      DelegationStore can be configured to run in-mem/DB/ZK based TokenStores. TokenStoreDelegationTokenSecretManager purges the tokens (>24 hours) periodically every 1 hour by default.

      Issue:
      When large number of delegation tokens are present in ZK, TokenStoreDelegationTokenSecretManager::removeExpiredTokens can throw the following exception when connecting to ZK. When packet length is greater than jute.maxBuffer, it ends up throwing this exception and it reconnects the connection.

      WARN [main-SendThread(xyz:2181)]: org.apache.zookeeper.ClientCnxn: Session 0x36a161083865cd9 for server xyz/1.2.3.4:2181, unexpected error, closing socket connection and attempting reconnect
      java.io.IOException: Packet len68985070 is out of range!
              at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:112) ~[zookeeper-3.4.6.jar]
              at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79) ~[zookeeper-3.4.6.jar]
              at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[zookeeper-3.4.6.jar]
              at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [zookeeper-3.4.6.jar]
      
      ...
      ...
      
       INFO [main-EventThread]: org.apache.curator.framework.state.ConnectionStateManager: State change: SUSPENDED
       ERROR [Thread[Thread-13,5,main]]: org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager: ExpiredTokenRemover thread received unexpected exception. org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error getting children for /hivedelegationMETASTORE/tokens
      org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error getting children for /hivedelegationMETASTORE/tokens
              at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.zkGetChildren(ZooKeeperTokenStore.java:280) ~[hive-exec-x.y.z.jar]
              at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.getAllDelegationTokenIdentifiers(ZooKeeperTokenStore.java:413) ~[hive-exec-x.y.z.jar]
              at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.removeExpiredTokens(TokenStoreDelegationTokenSecretManager.java:238) ~[hive-exec-x.y.z.jar]
              at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager$ExpiredTokenRemover.run(TokenStoreDelegationTokenSecretManager.java:309) [hive-exec-x.y.z.jar]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hivedelegationMETASTORE/tokens
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar]
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.6.jar]
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) ~[zookeeper-3.4.6.jar]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:214) ~[curator-framework-2.7.1.jar:?]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:203) ~[curator-framework-2.7.1.jar:?]
              at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:?]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:200) ~[curator-framework-2.7.1.jar:?]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:191) ~[curator-framework-2.7.1.jar:?]
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:38) ~[curator-framework-2.7.1.jar:?]
              at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.zkGetChildren(ZooKeeperTokenStore.java:278) ~[hive-exec-x.y.z.jar]
              ... 4 more        
      

      However, the same ZK client is being used for addToken, removeToken calls which are run in different threads. This creates problems when creating/deleting tokens.

      1. Issue in creating tokens:

      New token is added at the same time, when ZK client is in suspended state (due to above mentioned reason). Node is already created by curator, but before it could verify connection gets into stale state. So curator framework retries and ends up with the following exception. So creating tokens fails often.

      https://github.com/apache/curator/blob/apache-curator-2.7.0/curator-framework/src/main/java/org/apache/curator/framework/imps/CreateBuilderImpl.java#L461

      Caused by: org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error creating new node with path /hivedelegationMETASTORE/tokens/AAVwaXBlcgVwaXBlckdoaXZlL2hhZG9vcHNlcnZpY2UyNy1waHgyLnByb2QudWJlci5pbnRlcm5hbEBEQVRBU1JFLlBST0QuVUJFUi5JTlRFUk5BTIoBa9UkJgmKAWv5MKoJjQEo9Y4IUQ
              at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.addToken(ZooKeeperTokenStore.java:382)
              at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.createPassword(TokenStoreDelegationTokenSecretManager.java:142)
              at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.createPassword(TokenStoreDelegationTokenSecretManager.java:56)
              at org.apache.hadoop.security.token.Token.<init>(Token.java:62)
              at org.apache.hadoop.hive.thrift.DelegationTokenSecretManager.getDelegationToken(DelegationTokenSecretManager.java:109)
              at org.apache.hadoop.hive.thrift.HiveDelegationTokenManager$1.run(HiveDelegationTokenManager.java:131)
              at org.apache.hadoop.hive.thrift.HiveDelegationTokenManager$1.run(HiveDelegationTokenManager.java:127)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
              at org.apache.hadoop.hive.thrift.HiveDelegationTokenManager.getDelegationToken(HiveDelegationTokenManager.java:127)
              at org.apache.hadoop.hive.metastore.HiveMetaStore.getDelegationToken(HiveMetaStore.java:7060)
              at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_delegation_token(HiveMetaStore.java:5970)
              ... 20 more
      Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hivedelegationMETASTORE/tokens/AAVwaXBlcgVwaXBlckdoaXZlL2hhZG9vcHNlcnZpY2UyNy1waHgyLnByb2QudWJlci5pbnRlcm5hbEBEQVRBU1JFLlBST0QuVUJFUi5JTlRFUk5BTIoBa9UkJgmKAWv5MKoJjQEo9Y4IUQ
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
              at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:691)
              at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:675)
              at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
              at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:672)
              at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453)
              at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443)
              at org.apache.curator.framework.imps.CreateBuilderImpl$3.forPath(CreateBuilderImpl.java:251)
              at org.apache.curator.framework.imps.CreateBuilderImpl$3.forPath(CreateBuilderImpl.java:205)
              at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.addToken(ZooKeeperTokenStore.java:380)
              ... 32 more
      

      2. Issue in deleting tokens:

      Deleting tokens ends up with following exception when expiry thread breaks the ZK client. So it does not clear up the old tokens and keeps building more pressure on packet length.

      
      Caused by: org.apache.hadoop.hive.thrift.DelegationTokenStore$TokenStoreException: Error deleting /hivedelegationMETASTORE/tokens/AAVwaXBlcgVwaXBlckdoaXZlL2hhZG9vcHNlcnZpY2UyNS1waHgyLnByb2QudWJlci5pbnRlcm5hbEBEQVRBU1JFLlBST0QuVUJFUi5JTlRFUk5BTIoBa9ex1FWKAWv7vlhVjQGHdI4IUA
              at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.zkDelete(ZooKeeperTokenStore.java:340)
              at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.removeToken(ZooKeeperTokenStore.java:392)
              at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.cancelToken(TokenStoreDelegationTokenSecretManager.java:122)
              at org.apache.hadoop.hive.thrift.TokenStoreDelegationTokenSecretManager.cancelToken(TokenStoreDelegationTokenSecretManager.java:56)
              at org.apache.hadoop.hive.thrift.DelegationTokenSecretManager.cancelDelegationToken(DelegationTokenSecretManager.java:90)
              at org.apache.hadoop.hive.thrift.HiveDelegationTokenManager.cancelDelegationToken(HiveDelegationTokenManager.java:152)
              at org.apache.hadoop.hive.metastore.HiveMetaStore.cancelDelegationToken(HiveMetaStore.java:7049)
              at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.cancel_delegation_token(HiveMetaStore.java:5916)
              ... 20 more
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hivedelegationMETASTORE/tokens/AAVwaXBlcgVwaXBlckdoaXZlL2hhZG9vcHNlcnZpY2UyNS1waHgyLnByb2QudWJlci5pbnRlcm5hbEBEQVRBU1JFLlBST0QuVUJFUi5JTlRFUk5BTIoBa9ex1FWKAWv7vlhVjQGHdI4IUA
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
              at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
              at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:238)
              at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:233)
              at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
              at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230)
              at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:214)
              at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:41)
              at org.apache.hadoop.hive.thrift.ZooKeeperTokenStore.zkDelete(ZooKeeperTokenStore.java:336)
              ... 27 more
      
      

      And this process repeats every 5 seconds due to expiry thread setting

      https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/security/TokenStoreDelegationTokenSecretManager.java#L312

      lastTokenCacheCleanup never gets reset as well; so expiring tokens starts happening every 5 seconds instead of every hour https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/security/TokenStoreDelegationTokenSecretManager.java#L309

      Overall, system starts to run in degraded mode as this happens every 5 seconds.

      1. https://issues.apache.org/jira/browse/ZOOKEEPER-2933 tries to provide a way to monitor jute.maxBuffer. Not sure if curator frameworks make use of this recently.
      2. Another option is to update lastTokenCacheCleanup just before invoking removeExpiredTokens. This way, it still honours the expiry removal interval. In the exception, it would be good to add details about jute.maxBuffer, so that users can configure it to a higher value during next restart.
      3. Another option is to set jute.maxBuffer dynamically. 200 MB max limit would allow to clear up ~1.6 million entries. Additional message in the exception handling can give more info on this.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rajesh.balamohan Rajesh Balamohan
            Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: