Details
Description
This is a problem that has troubled us for several years. For our HBase cluster, sometimes the RS will be stuck due to
2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: The ticket isn't for us (35) - BAD TGS SERVER NAME)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194) at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140) at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187) at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95) at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325) at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781) at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37) at org.apache.hadoop.hbase.security.User.call(User.java:607) at org.apache.hadoop.hbase.security.User.access$700(User.java:51) at org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461) at org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004) at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107) at $Proxy24.replicateLogEntries(Unknown Source) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515) Caused by: GSSException: No valid credentials provided (Mechanism level: The ticket isn't for us (35) - BAD TGS SERVER NAME) at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175) ... 23 more Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:64) at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185) at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294) at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106) at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557) at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:594) ... 26 more Caused by: KrbException: Identifier doesn't match expected value (906) at sun.security.krb5.internal.KDCRep.init(KDCRep.java:133) at sun.security.krb5.internal.TGSRep.init(TGSRep.java:58) at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:53) at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:46) ... 31 more
It rarely happens, but if it happens, the regionserver will be stuck and can never recover.
Recently we added a log after a successful re-login which prints the private credentials, and finally catched the direct reason. After a successful re-login, we have two kerberos tickets in the credentials, one is the TGT, and the other is a service ticket. The strange thing is that, the service ticket is placed before TGT. This breaks the assumption of jdk's kerberos library. See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5InitCredential.java, the getTgt Method
return AccessController.doPrivileged( new PrivilegedExceptionAction<KerberosTicket>() { public KerberosTicket run() throws Exception { // It's OK to use null as serverPrincipal. TGT is almost // the first ticket for a principal and we use list. return Krb5Util.getTicket( realCaller, clientPrincipal, null, acc); }});
So here, the library will use the service ticket as TGT to acquire a service ticket, and KDC will reject the request since the 'TGT' does not start with 'krbtgt'. And it can never recover because in UGI, the re-login will check if there is a valid TGT first and no doubt, we have one...
This usually happens when a secure connection initialization comes along with the re-login, and the end time indicates that the service ticket is acquired by the previous TGT. Since UGI does not prevent doAs and re-login happen at the same time, we believe that there is a race condition.
After reading the code, we found a possible race condition.
See http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5Context.java, the initSecContext method, we will get TGT first, then check if there is already a service ticket, if not, acquire a service ticket using the TGT, and put it into the credentials.
And in Krb5LoginModule.logout(the sun version), we will remove the kerberos tickets from the credentials first, and then destroy them.
Here comes the race condition. Let T1 be the secure connection set up thread, T2 be the re-login thread.
T1: get TGT
T2: remove all tickets from credentials
T1: check service ticket, none(since all tickets have been removed)
T1: acquire a new service ticket using TGT and put it into the credentials
T2: destroy all tickets
T2: login, i.e., put a new TGT into the credentials.
It is hard to write a UT to produce the problem because the racing code is in jdk, which is not written by us...
Suggestions are welcomed. Thanks.
Attachments
Attachments
Issue Links
- breaks
-
HADOOP-14030 PreCommit TestKDiag failure
- Resolved
-
HADOOP-14191 Duplicate hadoop-minikdc dependency in hadoop-common module
- Resolved
- is related to
-
HADOOP-14037 client.handleSaslConnectionFailure needlessly wraps IOEs
- Patch Available
-
HADOOP-15378 Hadoop client unable to relogin because a remote DataNode has an incorrect krb5.conf
- Open
-
HADOOP-15143 NPE due to Invalid KerberosTicket in UGI
- Resolved