Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11590

Nodemanagers have DDoS our namenode due to HDFS_DELEGATION_TOKEN expired or not in the cache

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Minor
    • Resolution: Unresolved
    • 2.6.0
    • None
    • hdfs-client
    • None
    • Releases:
      cloudera release cdh-5.5.0
      openjdk version "1.8.0_91"
      linux centos6 servers

      Cluster info:
      Namenode and resourcemanager in HA with kerberos authentication
      More than 1300 datanodes/nodemanagers

    Description

      We have faced some huge slowdowns on our namenode due to all our nodemanagers continuing to retry to renew a lease and reconnecting to the namenode every second during 1 hour due to some HDFS_DELEGATION_TOKEN being expired or not in the cache.
      The number of time_wait connection on our namenode was stuck to the maximum configured of 250k during this period due to the reconnections each time.

      2017-03-02 11:51:42,817 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1488396860014_156103_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
        2017-03-02 11:51:43,414 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1488396860014_156120_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
        2017-03-02 11:51:51,994 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:prediction (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) is expired
        2017-03-02 11:51:51,995 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) is expired
        2017-03-02 11:51:51,995 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:prediction (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) is expired
        2017-03-02 11:51:51,995 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_1560141256_4187204] for 30 seconds.  Will retry shortly ...
        token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) is expired
           at org.apache.hadoop.ipc.Client.call(Client.java:1472)
           at org.apache.hadoop.ipc.Client.call(Client.java:1403)
           at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
           at com.sun.proxy.$Proxy20.renewLease(Unknown Source)
           at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
           at sun.reflect.GeneratedMethodAccessor74.invoke(Unknown Source)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
           at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
           at com.sun.proxy.$Proxy21.renewLease(Unknown Source)
           at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:921)
           at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
           at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
           at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
           at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
           at java.lang.Thread.run(Thread.java:745)
      
      
        2017-03-02 12:51:22,032 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:prediction (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) can't be found in cache
        2017-03-02 12:51:22,032 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) can't be found in cache
        2017-03-02 12:51:22,033 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:prediction (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) can't be found in cache
        2017-03-02 12:51:22,033 WARN org.apache.hadoop.hdfs.DFSClient: Failed to renew lease for DFSClient_NONMAPREDUCE_1560141256_4187204 for 3600 seconds (>= hard-limit =3600 seconds.) Closing all files being written ...
        token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) can't be found in cache
           at org.apache.hadoop.ipc.Client.call(Client.java:1472)
           at org.apache.hadoop.ipc.Client.call(Client.java:1403)
           at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
           at com.sun.proxy.$Proxy20.renewLease(Unknown Source)
           at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
           at sun.reflect.GeneratedMethodAccessor74.invoke(Unknown Source)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
           at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
           at com.sun.proxy.$Proxy21.renewLease(Unknown Source)
           at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:921)
           at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
           at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
           at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
           at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
           at java.lang.Thread.run(Thread.java:745)
        2017-03-02 12:51:27,364 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
      

      The root cause is the yarn proxy configuration having been removed, which in turn causes the resource manager to be unable to renew the HDFS_DELEGATION_TOKEN.
      Even though the root cause has been identified, I don't think retrying to renew a lease every second for an hour when there is an expiry/not found token issue is normal because this is not an issue that can be recovered.

      Attachments

        1. HDFS-11590.patch
          4 kB
          Nicolas Fraison
        2. HDFS-11590.003.patch
          7 kB
          Nicolas Fraison
        3. HDFS-11590.002.patch
          7 kB
          Nicolas Fraison
        4. HDFS-11590.001.patch
          5 kB
          Nicolas Fraison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nfraison.criteo Nicolas Fraison
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: