Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
5.1.3
-
None
-
None
-
None
Description
PhoenixDriver initializes and caches ConnectionQueryServices objects with connectionQueryServicesCache. As part of the CQSI initialization, connection is opened with HBase server by using HBase client provided ConnectionFactory, which provides Connection object to the client. The Connection object provided by HBase allows clients to share Zookeeper connection, meta cache as well as remote connections to regionservers and master daemons. The Connection object is used to perform Table CRUD operations as well as Administrative actions on the cluster.
HBase Connection object initialization requires ClusterId, which is maintained either in Zookeeper or Master daemons (or both) and retrieved by client depending on whether the client is configured to use ZKConnectionRegistry or MasterRegistry/RpcConnectionRegistry.
For ZKConnectionRegistry, we have run into an edge case wherein the connection to Zookeeper server got stuck for more than 12 hours. When the client tried to create connection to Zookeeper quorum to retrieve the ClusterId, Zookeeper leader was switched from one server to another. While the leader switch event resulting into stuck connection requires RCA, it is not appropriate for Phoenix/HBase client to indefinitely wait for the response from Zookeeper without any connection timeout.
For Phoenix client, if one thread is stuck in opening connection during CQSI#init, all other threads trying to create connections would get stuck because we take class level lock before opening the connection, leading to all threads getting stuck and potential termination or degradation of the client JVM.
While HBase client should also use timeout, however not having timeout from Phoenix client side has far worse complications. As part of this Jira, we should introduce a way for CQSI#openConnection to timeout, either by using CompletableFuture API or using our preconfigured thread-pool.
Stacktrace for reference:
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
java.lang.reflect.Constructor.newInstance
org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
java.security.AccessController.doPrivileged
javax.security.auth.Subject.doAs
org.apache.hadoop.security.UserGroupInformation.doAs
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.util.PhoenixContextExecutor.call
org.apache.phoenix.query.ConnectionQueryServicesImpl.init
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply
Attachments
Issue Links
- relates to
-
HBASE-28428 Zookeeper ConnectionRegistry APIs should have timeout
- Resolved