Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4235

Java Client SendThread does not clean up created objects during constructor of SaslClient and Login

    XMLWordPrintableJSON

Details

    Description

      Hi I am an Apache Phoenix committer and I help manage many many zookeeper clusters at my employment primarily using ZK for HBase use cases.  We recently had a production incident where some of our ACLs were not setup preventing connectivity from the client to the ZK nodes and the failure path exposed 2 issues to fix. This Jira and https://issues.apache.org/jira/browse/ZOOKEEPER-4236 .  This Jira is the more important of the 2 and handles the failure observed in that we had a FD/thread leak from the ZK java client send thread.  We had hundreds of threads per JVM with the following stack trace.

      java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketConnect(java.base@11.0.4.0.101/Native Method) at java.net.AbstractPlainSocketImpl.doConnect(java.base@11.0.4.0.101/AbstractPlainSocketImpl.java:399) - locked <0x00000015004fde20> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(java.base@11.0.4.0.101/AbstractPlainSocketImpl.java:242) at java.net.AbstractPlainSocketImpl.connect(java.base@11.0.4.0.101/AbstractPlainSocketImpl.java:224) at java.net.SocksSocketImpl.connect(java.base@11.0.4.0.101/SocksSocketImpl.java:403) at java.net.Socket.connect(java.base@11.0.4.0.101/Socket.java:609) at sun.security.krb5.internal.TCPClient.<init>(java.security.jgss@11.0.4.0.101/NetClient.java:62) at sun.security.krb5.internal.NetClient.getInstance(java.security.jgss@11.0.4.0.101/NetClient.java:42) at sun.security.krb5.KdcComm$KdcCommunication.run(java.security.jgss@11.0.4.0.101/KdcComm.java:401) at sun.security.krb5.KdcComm$KdcCommunication.run(java.security.jgss@11.0.4.0.101/KdcComm.java:364) at java.security.AccessController.doPrivileged(java.base@11.0.4.0.101/Native Method) at sun.security.krb5.KdcComm.send(java.security.jgss@11.0.4.0.101/KdcComm.java:348) at sun.security.krb5.KdcComm.sendIfPossible(java.security.jgss@11.0.4.0.101/KdcComm.java:253) at sun.security.krb5.KdcComm.send(java.security.jgss@11.0.4.0.101/KdcComm.java:234) at sun.security.krb5.KdcComm.send(java.security.jgss@11.0.4.0.101/KdcComm.java:200) at sun.security.krb5.KrbAsReqBuilder.send(java.security.jgss@11.0.4.0.101/KrbAsReqBuilder.java:326) at sun.security.krb5.KrbAsReqBuilder.action(java.security.jgss@11.0.4.0.101/KrbAsReqBuilder.java:371) at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(jdk.security.auth@11.0.4.0.101/Krb5LoginModule.java:754) at com.sun.security.auth.module.Krb5LoginModule.login(jdk.security.auth@11.0.4.0.101/Krb5LoginModule.java:592) at javax.security.auth.login.LoginContext.invoke(java.base@11.0.4.0.101/LoginContext.java:726) at javax.security.auth.login.LoginContext$4.run(java.base@11.0.4.0.101/LoginContext.java:665) at javax.security.auth.login.LoginContext$4.run(java.base@11.0.4.0.101/LoginContext.java:663) at java.security.AccessController.doPrivileged(java.base@11.0.4.0.101/Native Method) at javax.security.auth.login.LoginContext.invokePriv(java.base@11.0.4.0.101/LoginContext.java:663) at javax.security.auth.login.LoginContext.login(java.base@11.0.4.0.101/LoginContext.java:574) at org.apache.zookeeper.Login.login(Login.java:304) - locked <0x000000151c477148> (a org.apache.zookeeper.Login) at org.apache.zookeeper.Login.<init>(Login.java:106) at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:249) - locked <0x000000151c476f68> (a org.apache.zookeeper.client.ZooKeeperSaslClient) at org.apache.zookeeper.client.ZooKeeperSaslClient.<init>(ZooKeeperSaslClient.java:141) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:972) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1031)
      

      Note that today ZooKeeperSaslClient as well as Login both allocate resources in their constructors and thus cannot be cleaned up or interrupted via close/shutdown/disconnect of their parents due to still being a null object during initialization.  This leaves the thread/sockets at the mercy of the configured kdc retry/timeout configuration.

      This Jira is intended to break the constructor and the initialization path into separate methods and properly clean up the resulting objects.

        

      Attachments

        Issue Links

          Activity

            People

              andor Andor Molnar
              dbwong Daniel Wong
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 20m
                  5h 20m