[HBASE-28428] Zookeeper ConnectionRegistry APIs should have timeout - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.17, 3.0.0-beta-1, 2.5.8
Fix Version/s: 2.7.0, 3.0.0-beta-2, 2.6.1, 2.5.10
Component/s: None
Labels:
- pull-request-available

Hadoop Flags:

Reviewed

Description

Came across a couple of instances where active master failover happens around the same time as Zookeeper leader failover, leading to stuck HBase client if one of the threads is blocked on one of the ConnectionRegistry rpc calls.
ConnectionRegistry APIs are wrapped with CompletableFuture. However, their usages do not have any timeouts, which can potentially lead to the entire client in stuck state indefinitely as we take some global locks. For instance, getKeepAliveMasterService() takes
masterLock, hence if getting active master from masterAddressZNode gets stuck, we can block any admin operation that needs getKeepAliveMasterService().

Sample stacktrace that blocked all client operations that required table descriptor from Admin:

jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.get
org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
org.apache.hadoop.hbase.client.MasterCallable.prepare
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
org.apache.phoenix.execute.MutationState.sendBatch
org.apache.phoenix.execute.MutationState.send
org.apache.phoenix.execute.MutationState.send
org.apache.phoenix.execute.MutationState.commit
org.apache.phoenix.jdbc.PhoenixConnection$?.call
org.apache.phoenix.jdbc.PhoenixConnection$?.call
org.apache.phoenix.call.CallRunner.run
org.apache.phoenix.jdbc.PhoenixConnection.commit

Another similar incident is captured on ~~PHOENIX-7233~~. In this case, retrieving clusterId from ZNode got stuck and that blocked client from being able to create any more HBase Connection. Stacktrace for referece:

jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
java.lang.reflect.Constructor.newInstance
org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
java.security.AccessController.doPrivileged
javax.security.auth.Subject.doAs
org.apache.hadoop.security.UserGroupInformation.doAs
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.util.PhoenixContextExecutor.call
org.apache.phoenix.query.ConnectionQueryServicesImpl.init
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply

We should provide configurable timeout for all ConnectionRegistry APIs.

Attachments

Issue Links

is related to

HBASE-28741 Rpc ConnectionRegistry APIs should have timeout

Open

PHOENIX-7233 CQSI openConnection should timeout to unblock other connection threads

Resolved

links to

GitHub Pull Request #5837

GitHub Pull Request #6095

Sub-Tasks

1.

Default Zookeeper ConnectionRegistry APIs timeout should be less

Resolved

Divneet Kaur

Zookeeper ConnectionRegistry APIs should have timeout

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates