[HBASE-6309] [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Abandoned
Affects Version/s: 0.92.1, 0.94.0, 0.95.2
Fix Version/s: None
Component/s: MTTR
Labels:
None

Description

We found this issue during the leap second cataclysm which prompted a distributed splitting of all our logs.

I saw that none of the RS were splitting after some time while the master was showing that it wasn't even 30% done. jstack'ing I saw this:

"main-EventThread" daemon prio=10 tid=0x00007f6ce46d8800 nid=0x5376 in
Object.wait() [0x00007f6ce2ecb000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.hadoop.ipc.Client.call(Client.java:1093)
        - locked <0x00000005fdd661a0> (a org.apache.hadoop.ipc.Client$Call)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy9.rename(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy9.rename(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:759)
        at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:253)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:553)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:519)
        at org.apache.hadoop.hbase.master.SplitLogManager$1.finish(SplitLogManager.java:138)
        at org.apache.hadoop.hbase.master.SplitLogManager.getDataSetWatchSuccess(SplitLogManager.java:431)
        at org.apache.hadoop.hbase.master.SplitLogManager.access$1200(SplitLogManager.java:95)
        at org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback.processResult(SplitLogManager.java:1011)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:571)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)

We are effectively bottlenecking on doing NN operations and whatever else is happening in GetDataAsyncCallback. It was so bad that on our 100 offline cluster it took a few hours for the master to process all the incoming ZK events while the actual splitting took a fraction of that time.

I'm marking this as critical and against 0.96 but depending on how involved the fix is we might want to backport.

Attachments

Issue Links

is required by

HBASE-5843 Improve HBase MTTR - Mean Time To Recover

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jean-Daniel Cryans

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 02/Jul/12 23:52

Updated:: 13/Jun/22 16:46

Resolved:: 13/Jun/22 16:46