Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-13971

Flushes stuck since 6 hours on a regionserver.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 1.3.0
    • 1.2.0, 1.3.0, 2.0.0
    • regionserver
    • None
    • Caused while running IntegrationTestLoadAndVerify for 20 M rows on cluster with 32 region servers each with max heap size of 24GBs.

    • Reviewed

    Description

      One region server stuck while flushing(possible deadlock). Its trying to flush two regions since last 6 hours (see the screenshot).
      Caused while running IntegrationTestLoadAndVerify for 20 M rows with 600 mapper jobs and 100 back references. ~37 Million writes on each regionserver till now but no writes happening on any regionserver from past 6 hours and their memstore size is zero(I dont know if this is related). But this particular regionserver has memstore size of 9GBs from past 6 hours.

      Relevant snaps from debug dump:
      Tasks:
      ===========================================================
      Task: Flushing IntegrationTestLoadAndVerify,R\x9B\x1B\xBF\xAE\x08\xD1\xA2,1435179555993.8e2d075f94ce7699f416ec4ced9873cd.
      Status: RUNNING:Preparing to flush by snapshotting stores in 8e2d075f94ce7699f416ec4ced9873cd
      Running for 22034s

      Task: Flushing IntegrationTestLoadAndVerify,\x93\xA385\x81Z\x11\xE6,1435179555993.9f8d0e01a40405b835bf6e5a22a86390.
      Status: RUNNING:Preparing to flush by snapshotting stores in 9f8d0e01a40405b835bf6e5a22a86390
      Running for 22033s

      Executors:
      ===========================================================
      ...
      Thread 139 (MemStoreFlusher.1):
      State: WAITING
      Blocked count: 139711
      Waited count: 239212
      Waiting on java.util.concurrent.CountDownLatch$Sync@b9c094a
      Stack:
      sun.misc.Unsafe.park(Native Method)
      java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
      java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
      java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
      org.apache.hadoop.hbase.wal.WALKey.getSequenceId(WALKey.java:305)
      org.apache.hadoop.hbase.regionserver.HRegion.getNextSequenceId(HRegion.java:2422)
      org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2168)
      org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2047)
      org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2011)
      org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1902)
      org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1828)
      org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510)
      org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471)
      org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
      org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
      java.lang.Thread.run(Thread.java:745)
      Thread 137 (MemStoreFlusher.0):
      State: WAITING
      Blocked count: 138931
      Waited count: 237448
      Waiting on java.util.concurrent.CountDownLatch$Sync@53f41f76
      Stack:
      sun.misc.Unsafe.park(Native Method)
      java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
      java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
      java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
      org.apache.hadoop.hbase.wal.WALKey.getSequenceId(WALKey.java:305)
      org.apache.hadoop.hbase.regionserver.HRegion.getNextSequenceId(HRegion.java:2422)
      org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2168)
      org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2047)
      org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2011)
      org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1902)
      org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1828)
      org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510)
      org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471)
      org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
      org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
      java.lang.Thread.run(Thread.java:745)

      Attachments

        1. screenshot-1.png
          119 kB
          Abhilash
        2. rsDebugDump.txt
          249 kB
          Abhilash
        3. jstack.4
          146 kB
          Elliott Neil Clark
        4. jstack.5
          146 kB
          Elliott Neil Clark
        5. jstack.1
          146 kB
          Elliott Neil Clark
        6. jstack.2
          146 kB
          Elliott Neil Clark
        7. jstack.3
          146 kB
          Elliott Neil Clark
        8. 13971-v1.txt
          4 kB
          Ted Yu
        9. 13971-v1.txt
          4 kB
          Ted Yu
        10. 13971-v1.txt
          4 kB
          Ted Yu

        Issue Links

          Activity

            People

              yuzhihong@gmail.com Ted Yu
              abhilak Abhilash
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: