Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.17, 3.0.0-beta-1, 2.5.8
-
None
-
Reviewed
Description
Came across a couple of instances where active master failover happens around the same time as Zookeeper leader failover, leading to stuck HBase client if one of the threads is blocked on one of the ConnectionRegistry rpc calls.
ConnectionRegistry APIs are wrapped with CompletableFuture. However, their usages do not have any timeouts, which can potentially lead to the entire client in stuck state indefinitely as we take some global locks. For instance, getKeepAliveMasterService() takes
masterLock, hence if getting active master from masterAddressZNode gets stuck, we can block any admin operation that needs getKeepAliveMasterService().
Sample stacktrace that blocked all client operations that required table descriptor from Admin:
jdk.internal.misc.Unsafe.park java.util.concurrent.locks.LockSupport.park java.util.concurrent.CompletableFuture$Signaller.block java.util.concurrent.ForkJoinPool.managedBlock java.util.concurrent.CompletableFuture.waitingGet java.util.concurrent.CompletableFuture.get org.apache.hadoop.hbase.client.ConnectionImplementation.get org.apache.hadoop.hbase.client.ConnectionImplementation.access$? org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster org.apache.hadoop.hbase.client.MasterCallable.prepare org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations org.apache.phoenix.execute.MutationState.sendBatch org.apache.phoenix.execute.MutationState.send org.apache.phoenix.execute.MutationState.send org.apache.phoenix.execute.MutationState.commit org.apache.phoenix.jdbc.PhoenixConnection$?.call org.apache.phoenix.jdbc.PhoenixConnection$?.call org.apache.phoenix.call.CallRunner.run org.apache.phoenix.jdbc.PhoenixConnection.commit
Another similar incident is captured on PHOENIX-7233. In this case, retrieving clusterId from ZNode got stuck and that blocked client from being able to create any more HBase Connection. Stacktrace for referece:
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
java.lang.reflect.Constructor.newInstance
org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
java.security.AccessController.doPrivileged
javax.security.auth.Subject.doAs
org.apache.hadoop.security.UserGroupInformation.doAs
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.util.PhoenixContextExecutor.call
org.apache.phoenix.query.ConnectionQueryServicesImpl.init
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply
We should provide configurable timeout for all ConnectionRegistry APIs.
Attachments
Issue Links
- is related to
-
HBASE-28741 Rpc ConnectionRegistry APIs should have timeout
- Open
-
PHOENIX-7233 CQSI openConnection should timeout to unblock other connection threads
- Resolved
- links to