[HBASE-25984] FSHLog WAL lockup with sync future reuse [RS deadlock] - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
Fix Version/s: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
Component/s: regionserver, wal
Labels:
- deadlock
- hang

Release Note:

Hide
Fixes a WAL lockup issue due to premature reuse of the sync futures by the WAL consumers. The lockup causes the WAL system to hang resulting in blocked appends and syncs thus holding up the RPC handlers from progressing. Only workaround without this fix is to force abort the region server.

Show
Fixes a WAL lockup issue due to premature reuse of the sync futures by the WAL consumers. The lockup causes the WAL system to hang resulting in blocked appends and syncs thus holding up the RPC handlers from progressing. Only workaround without this fix is to force abort the region server.

Description

We use FSHLog as the WAL implementation (branch-1 based) and under heavy load we noticed the WAL system gets locked up due to a subtle bug involving racy code with sync future reuse. This bug applies to all FSHLog implementations across branches.

Symptoms:

On heavily loaded clusters with large write load we noticed that the region servers are hanging abruptly with filled up handler queues and stuck MVCC indicating appends/syncs not making any progress.

 WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, regionName=1ce4003ab60120057734ffe367667dca}
 WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, regionName=7c441d7243f9f504194dae6bf2622631}

All the handlers are stuck waiting for the sync futures and timing out.

 java.lang.Object.wait(Native Method)
    org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
    org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
    .....

Log rolling is stuck because it was unable to attain a safe point

   java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
 org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)

and the Ring buffer consumer thinks that there are some outstanding syncs that need to finish..

  org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
    org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
    org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)

On the other hand, SyncRunner threads are idle and just waiting for work implying that there are no pending SyncFutures that need to be run

   sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
    java.lang.Thread.run(Thread.java:748)

Overall the WAL system is dead locked and could make no progress until it was aborted. I got to the bottom of this issue and have a patch that can fix it (more details in the comments due to word limit in the description).