Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-25984

FSHLog WAL lockup with sync future reuse [RS deadlock]

    XMLWordPrintableJSON

Details

    • Hide
      Fixes a WAL lockup issue due to premature reuse of the sync futures by the WAL consumers. The lockup causes the WAL system to hang resulting in blocked appends and syncs thus holding up the RPC handlers from progressing. Only workaround without this fix is to force abort the region server.
      Show
      Fixes a WAL lockup issue due to premature reuse of the sync futures by the WAL consumers. The lockup causes the WAL system to hang resulting in blocked appends and syncs thus holding up the RPC handlers from progressing. Only workaround without this fix is to force abort the region server.

    Description

      We use FSHLog as the WAL implementation (branch-1 based) and under heavy load we noticed the WAL system gets locked up due to a subtle bug involving racy code with sync future reuse. This bug applies to all FSHLog implementations across branches.

      Symptoms:

      On heavily loaded clusters with large write load we noticed that the region servers are hanging abruptly with filled up handler queues and stuck MVCC indicating appends/syncs not making any progress.

       WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, regionName=1ce4003ab60120057734ffe367667dca}
       WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - STUCK for : 296000 millis. MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, regionName=7c441d7243f9f504194dae6bf2622631}
      

      All the handlers are stuck waiting for the sync futures and timing out.

       java.lang.Object.wait(Native Method)
          org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
          org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
          .....
      

      Log rolling is stuck because it was unable to attain a safe point

         java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
      org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
       org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
      

      and the Ring buffer consumer thinks that there are some outstanding syncs that need to finish..

        org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
          org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
          org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
      

      On the other hand, SyncRunner threads are idle and just waiting for work implying that there are no pending SyncFutures that need to be run

         sun.misc.Unsafe.park(Native Method)
          java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
          java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
          org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
          java.lang.Thread.run(Thread.java:748)
      

      Overall the WAL system is dead locked and could make no progress until it was aborted. I got to the bottom of this issue and have a patch that can fix it (more details in the comments due to word limit in the description).

      Attachments

        1. HBASE-25984-unit-test.patch
          6 kB
          Bharath Vissapragada

        Issue Links

          Activity

            People

              bharathv Bharath Vissapragada
              bharathv Bharath Vissapragada
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: