[HBASE-25677] Server+table counters on each scan #nextRaw invocation becomes a bottleneck when heavy load - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.2
Fix Version/s: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.3
Component/s: metrics
Labels:
None

Hadoop Flags:

Reviewed

Description

On a heavily loaded server mostly doing reads/scan, I saw that 90+% of handlers were BLOCKED in this fashion in thread dumps:

"RpcServer.default.FPBQ.Fifo.handler=117,queue=17,port=16020" #161 daemon prio=5 os_prio=0 tid=0x00007f748757f000 nid=0x73e9 waiting for monitor entry [0x00007f74783e0000]
  java.lang.Thread.State: BLOCKED (on object monitor)
       at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1674)
       - waiting to lock <0x00007f7647e3cc38> (a java.util.concurrent.ConcurrentHashMap$Node)
       at org.apache.hadoop.hbase.regionserver.MetricsTableQueryMeterImpl.getOrCreateTableMeter(MetricsTableQueryMeterImpl.java:80)
       at org.apache.hadoop.hbase.regionserver.MetricsTableQueryMeterImpl.updateTableReadQueryMeter(MetricsTableQueryMeterImpl.java:90)
       at org.apache.hadoop.hbase.regionserver.RegionServerTableMetrics.updateTableReadQueryMeter(RegionServerTableMetrics.java:89)
       at org.apache.hadoop.hbase.regionserver.MetricsRegionServer.updateReadQueryMeter(MetricsRegionServer.java:274)
       at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6742)
       at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3319)
       - locked <0x00007f896c0165a0> (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
       at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3566)
       at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:44858)
       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:393)
       at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
       at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
       at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)

It kept up for good periods of time.

I saw it to a leser extent on other servers, with less load.

These RS had 400+ Regions a good few of which were serving out scan reads; the server was doing ~1M hits a second. In this scenario, I saw the above bottleneck.

Looking at it, it came in w/ when the parent issue feature was added. There are these read counts and then there were also write counts. The write counts are mostly batch-based. Let me do same thing here for the read.... update the central server+table count after scan is done rather than per invocation of #nextRaw.

Attachments

Issue Links

breaks

HBASE-26013 Get operations readRows metrics becomes zero after HBASE-25677

Resolved

links to

GitHub Pull Request #3061

GitHub Pull Request #3066

Activity

People

Assignee:: Michael Stack

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 18/Mar/21 05:36

Updated:: 25/Jun/21 01:50

Resolved:: 18/Mar/21 18:48