Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
Description
We use FSHLog as the WAL implementation (branch-1 based) and under heavy load we noticed the WAL system gets locked up due to a subtle bug involving racy code with sync future reuse. This bug applies to all FSHLog implementations across branches.
Symptoms:
On heavily loaded clusters with large write load we noticed that the region servers are hanging abruptly with filled up handler queues and stuck MVCC indicating appends/syncs not making any progress.
WARN [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, regionName=1ce4003ab60120057734ffe367667dca} WARN [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, regionName=7c441d7243f9f504194dae6bf2622631}
All the handlers are stuck waiting for the sync futures and timing out.
java.lang.Object.wait(Native Method) org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183) org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509) .....
Log rolling is stuck because it was unable to attain a safe point
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799) org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
and the Ring buffer consumer thinks that there are some outstanding syncs that need to finish..
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031) org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
On the other hand, SyncRunner threads are idle and just waiting for work implying that there are no pending SyncFutures that need to be run
sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297) java.lang.Thread.run(Thread.java:748)
Overall the WAL system is dead locked and could make no progress until it was aborted. I got to the bottom of this issue and have a patch that can fix it (more details in the comments due to word limit in the description).
Attachments
Attachments
Issue Links
- is related to
-
HBASE-21228 Memory leak since AbstractFSWAL caches Thread object and never clean later
- Resolved
- links to