[HBASE-19163] "Maximum lock count exceeded" from region server's batch processing - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha-1, 2.0.0-alpha-3, 1.2.7
Fix Version/s: 1.4.1, 1.3.3, 2.0.0
Component/s: regionserver
Labels:
None

Release Note:

Hide
When there are many mutations against the same row in a batch, as each mutation will acquire a shared row lock, it will exceed the maximum shared lock count the java ReadWritelock supports (64k). Along with other optimization, the batch is divided into multiple possible minibatches. A new config is added to limit the maximum number of mutations in the minibatch.

   <property>
    <name>hbase.regionserver.minibatch.size</name>
    <value>20000</value>
   </property>
The default value is 20000.

Show
When there are many mutations against the same row in a batch, as each mutation will acquire a shared row lock, it will exceed the maximum shared lock count the java ReadWritelock supports (64k). Along with other optimization, the batch is divided into multiple possible minibatches. A new config is added to limit the maximum number of mutations in the minibatch.    <property>     <name>hbase.regionserver.minibatch.size</name>     <value>20000</value>    </property> The default value is 20000.

Description

In one of use cases, we found the following exception and replication is stuck.

2017-10-25 19:41:17,199 WARN  [hconnection-0x28db294f-shared--pool4-t936] client.AsyncProcess: #3, table=foo, attempt=5/5 failed=262836ops, last exception: java.io.IOException: java.io.IOException: Maximum lock count exceeded
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2215)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)
Caused by: java.lang.Error: Maximum lock count exceeded
        at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.fullTryAcquireShared(ReentrantReadWriteLock.java:528)
        at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:488)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1327)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
        at org.apache.hadoop.hbase.regionserver.HRegion.getRowLock(HRegion.java:5163)
        at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3018)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2877)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2819)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:753)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:715)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2148)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
        ... 3 more

While we are still examining the data pattern, it is sure that there are too many mutations in the batch against the same row, this exceeds the maximum 64k shared lock count and it throws an error and failed the whole batch.

There are two approaches to solve this issue.

1). Let's say there are mutations against the same row in the batch, we just need to acquire the lock once for the same row vs to acquire the lock for each mutation.
2). We catch the error and start to process whatever it gets and loop back.

With ~~HBASE-17924~~, approach 1 seems easy to implement now.
Create the jira and will post update/patch when investigation moving forward.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-19163.master.001.patch
16/Nov/17 01:39
5 kB
Hua Xiang
HBASE-19163.master.002.patch
17/Nov/17 21:15
5 kB
Hua Xiang
HBASE-19163.master.004.patch
17/Nov/17 22:30
5 kB
Hua Xiang
HBASE-19163.master.005.patch
20/Nov/17 08:12
13 kB
Hua Xiang
HBASE-19163.master.006.patch
20/Nov/17 22:13
17 kB
Hua Xiang
HBASE-19163.master.007.patch
27/Nov/17 19:14
17 kB
Hua Xiang
HBASE-19163.master.008.patch
28/Nov/17 17:09
18 kB
Hua Xiang
HBASE-19163.master.009.patch
30/Nov/17 00:02
19 kB
Hua Xiang
HBASE-19163.master.009.patch
29/Nov/17 02:08
19 kB
Hua Xiang
HBASE-19163.master.010.patch
05/Dec/17 21:19
13 kB
Hua Xiang
HBASE-19163-branch-1-v001.patch
05/Jan/18 22:01
8 kB
Hua Xiang
HBASE-19163-branch-1-v001.patch
05/Jan/18 01:26
8 kB
Hua Xiang
HBASE-19163-master-v001.patch
15/Nov/17 23:56
5 kB
Hua Xiang
unittest-case.diff
15/Nov/17 21:41
1 kB
Hua Xiang

Issue Links

links to

Review Board (master)

"Maximum lock count exceeded" from region server's batch processing

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates