Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
    • Fix Version/s: 2.9.0, 2.7.4, 2.6.6, 3.0.0-alpha4, 2.8.2
    • Component/s: security
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      This is a problem that has troubled us for several years. For our HBase cluster, sometimes the RS will be stuck due to

      2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception encountered while connecting to the server :
      javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: The ticket isn't for us (35) - BAD TGS SERVER NAME)]
              at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194)
              at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140)
              at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187)
              at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95)
              at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325)
              at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:396)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
              at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37)
              at org.apache.hadoop.hbase.security.User.call(User.java:607)
              at org.apache.hadoop.hbase.security.User.access$700(User.java:51)
              at org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461)
              at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321)
              at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164)
              at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004)
              at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
              at $Proxy24.replicateLogEntries(Unknown Source)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515)
      Caused by: GSSException: No valid credentials provided (Mechanism level: The ticket isn't for us (35) - BAD TGS SERVER NAME)
              at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663)
              at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
              at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180)
              at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175)
              ... 23 more
      Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME
              at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:64)
              at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185)
              at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294)
              at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106)
              at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557)
              at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:594)
              ... 26 more
      Caused by: KrbException: Identifier doesn't match expected value (906)
              at sun.security.krb5.internal.KDCRep.init(KDCRep.java:133)
              at sun.security.krb5.internal.TGSRep.init(TGSRep.java:58)
              at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:53)
              at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:46)
              ... 31 more‚Äč
      

      It rarely happens, but if it happens, the regionserver will be stuck and can never recover.

      Recently we added a log after a successful re-login which prints the private credentials, and finally catched the direct reason. After a successful re-login, we have two kerberos tickets in the credentials, one is the TGT, and the other is a service ticket. The strange thing is that, the service ticket is placed before TGT. This breaks the assumption of jdk's kerberos library. See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5InitCredential.java, the getTgt Method

      Krb5InitCredential
                  return AccessController.doPrivileged(
                      new PrivilegedExceptionAction<KerberosTicket>() {
                      public KerberosTicket run() throws Exception {
                          // It's OK to use null as serverPrincipal. TGT is almost
                          // the first ticket for a principal and we use list.
                          return Krb5Util.getTicket(
                              realCaller,
                              clientPrincipal, null, acc);
                              }});
      

      So here, the library will use the service ticket as TGT to acquire a service ticket, and KDC will reject the request since the 'TGT' does not start with 'krbtgt'. And it can never recover because in UGI, the re-login will check if there is a valid TGT first and no doubt, we have one...

      This usually happens when a secure connection initialization comes along with the re-login, and the end time indicates that the service ticket is acquired by the previous TGT. Since UGI does not prevent doAs and re-login happen at the same time, we believe that there is a race condition.

      After reading the code, we found a possible race condition.

      See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5Context.java, the initSecContext method, we will get TGT first, then check if there is already a service ticket, if not, acquire a service ticket using the TGT, and put it into the credentials.

      And in Krb5LoginModule.logout(the sun version), we will remove the kerberos tickets from the credentials first, and then destroy them.

      Here comes the race condition. Let T1 be the secure connection set up thread, T2 be the re-login thread.

      T1: get TGT
      T2: remove all tickets from credentials
      T1: check service ticket, none(since all tickets have been removed)
      T1: acquire a new service ticket using TGT and put it into the credentials
      T2: destroy all tickets
      T2: login, i.e., put a new TGT into the credentials.

      It is hard to write a UT to produce the problem because the racing code is in jdk, which is not written by us...

      Suggestions are welcomed. Thanks.

        Attachments

        1. HADOOP-13433.patch
          13 kB
          Duo Zhang
        2. HADOOP-13433-branch-2.7.patch
          6 kB
          Duo Zhang
        3. HADOOP-13433-branch-2.7-v1.patch
          6 kB
          Duo Zhang
        4. HADOOP-13433-branch-2.7-v2.patch
          6 kB
          Duo Zhang
        5. HADOOP-13433-branch-2.8.patch
          6 kB
          Xiao Chen
        6. HADOOP-13433-branch-2.8.patch
          6 kB
          Duo Zhang
        7. HADOOP-13433-branch-2.8-v1.patch
          6 kB
          Duo Zhang
        8. HADOOP-13433-branch-2.patch
          6 kB
          Duo Zhang
        9. HADOOP-13433-v1.patch
          13 kB
          Duo Zhang
        10. HADOOP-13433-v2.patch
          13 kB
          Duo Zhang
        11. HADOOP-13433-v4.patch
          20 kB
          Duo Zhang
        12. HADOOP-13433-v5.patch
          19 kB
          Duo Zhang
        13. HADOOP-13433-v6.patch
          19 kB
          Duo Zhang
        14. HBASE-13433-testcase-v3.patch
          7 kB
          Duo Zhang

          Issue Links

            Activity

              People

              • Assignee:
                Apache9 Duo Zhang
                Reporter:
                Apache9 Duo Zhang
              • Votes:
                0 Vote for this issue
                Watchers:
                32 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: