HBase
  1. HBase
  2. HBASE-3622

Deadlock in HBaseServer (JVM bug?)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Won't Fix
    • Affects Version/s: 0.90.1
    • Fix Version/s: 0.92.0
    • Component/s: None
    • Labels:
      None

      Description

      On Dmitriy's cluster:

      
      "IPC Reader 0 on port 60020" prio=10 tid=0x00002aacb4a82800 nid=0x3a72 waiting on condition [0x00000000429ba000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00002aaabf5fa6d0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
              at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
              at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
              at java.util.concurrent.LinkedBlockingQueue.signalNotEmpty(LinkedBlockingQueue.java:103)
              at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:267)
              at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:985)
              at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:946)
              at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
              at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
              - locked <0x00002aaabf580fb0> (a org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      ...
      "IPC Server handler 29 on 60020" daemon prio=10 tid=0x00002aacbc163800 nid=0x3acc waiting on condition [0x00000000462f3000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00002aaabf5e3800> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
              at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
              at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1025)
      "IPC Server handler 28 on 60020" daemon prio=10 tid=0x00002aacbc161800 nid=0x3acb waiting on condition [0x00000000461f2000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00002aaabf5e3800> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
              at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
              at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1025
      ...
      

      This region server stayed in this state for hours. The reader is waiting to put and the handlers are waiting to take, and they wait on different lock ids. It reminds me of the UseMembar thing about the JVM sometime missing to notify waiters. In any case, that RS needed to be closed in order to get out of that state.

      1. HBASE-3622.patch
        0.7 kB
        Andrew Purtell

        Activity

        Hide
        Andrew Purtell added a comment -

        We run with -XX:+UseMembar. It is part of our internal recommended production configuration. We've benchmarked with and without using YCSB any difference is in the noise.

        Show
        Andrew Purtell added a comment - We run with -XX:+UseMembar. It is part of our internal recommended production configuration. We've benchmarked with and without using YCSB any difference is in the noise.
        Hide
        stack added a comment -

        I don't know about this. Maybe it makes sense. I think the fb folks run w/ this flag. Sounds like it should make it all slower but you say it doesn't Andy.

        Show
        stack added a comment - I don't know about this. Maybe it makes sense. I think the fb folks run w/ this flag. Sounds like it should make it all slower but you say it doesn't Andy.
        Hide
        Jonathan Gray added a comment -

        We run with +UseMembar at FB. I ran experiments on CPU-bound workloads and there was no significant difference in performance either way.

        Show
        Jonathan Gray added a comment - We run with +UseMembar at FB. I ran experiments on CPU-bound workloads and there was no significant difference in performance either way.
        Hide
        stack added a comment -

        @Andrew So sounds like it doesn't drag perf but how we know it addresses the issue? Should we do your patch commit against another issue altogether?

        Show
        stack added a comment - @Andrew So sounds like it doesn't drag perf but how we know it addresses the issue? Should we do your patch commit against another issue altogether?
        Hide
        Ted Yu added a comment -

        I noticed this:

          protected BlockingQueue<Call> callQueue; // queued calls
          protected BlockingQueue<Call> priorityCallQueue;
        

        I wonder if what values the following parameters carried in his cluster:
        hbase.regionserver.handler.count
        hbase.regionserver.metahandler.count

        Show
        Ted Yu added a comment - I noticed this: protected BlockingQueue<Call> callQueue; // queued calls protected BlockingQueue<Call> priorityCallQueue; I wonder if what values the following parameters carried in his cluster: hbase.regionserver.handler.count hbase.regionserver.metahandler.count
        Hide
        stack added a comment -

        What you thinking Ted?

        Show
        stack added a comment - What you thinking Ted?
        Hide
        Andrew Purtell added a comment -

        @Stack: Makes sense to make the UseMembar change on a separate issue, sure.

        Show
        Andrew Purtell added a comment - @Stack: Makes sense to make the UseMembar change on a separate issue, sure.
        Hide
        stack added a comment -

        @Dmitry sent me more thread dumps. This time we're in a different location. He sent me logs an hour apart. Its stuck in same place in both cases. We're here:

        1455 "IPC Server handler 2 on 60020" daemon prio=10 tid=0x0000000048d0d800 nid=0x7550 waiting on condition [0x0000000044b6e000]
        1456    java.lang.Thread.State: WAITING (parking)
        1457     at sun.misc.Unsafe.park(Native Method)
        1458     - parking to wait for  <0x00002aaac04a9d08> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        1459     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        1460     at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
        1461     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
        1462     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
        1463     at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
        1464     at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
        1465     at java.util.concurrent.DelayQueue.offer(DelayQueue.java:83)
        1466     at java.util.concurrent.DelayQueue.add(DelayQueue.java:71)
        1467     at org.apache.hadoop.hbase.regionserver.Leases.renewLease(Leases.java:194)
        

        A few threads are 'parking to wait for' 0x00002aaac04a9d08. This one is too:

         599 "regionserver60020.leaseChecker" daemon prio=10 tid=0x0000000048db0800 nid=0x7548 waiting on condition [0x0000000044366000]
         600    java.lang.Thread.State: WAITING (parking)
         601     at sun.misc.Unsafe.park(Native Method)
         602     - parking to wait for  <0x00002aaac04a9d08> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
         603     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
         604     at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
         605     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
         606     at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1971)
         607     at java.util.concurrent.DelayQueue.poll(DelayQueue.java:209)
         608     at org.apache.hadoop.hbase.regionserver.Leases.run(Leases.java:82)
        

        These are the only mentions of 0x00002aaac04a9d08 in all thread dumps.

        It sounds like this, http://bugs.sun.com/view_bug.do?bug_id=6822370. Though the original report is on sun hw, later folks chime in w/ the issue on various linux. Setting membar seems to be a workaround, or running with u18/u21 where bug is purportedly fixed.

        Show
        stack added a comment - @Dmitry sent me more thread dumps. This time we're in a different location. He sent me logs an hour apart. Its stuck in same place in both cases. We're here: 1455 "IPC Server handler 2 on 60020" daemon prio=10 tid=0x0000000048d0d800 nid=0x7550 waiting on condition [0x0000000044b6e000] 1456 java.lang. Thread .State: WAITING (parking) 1457 at sun.misc.Unsafe.park(Native Method) 1458 - parking to wait for <0x00002aaac04a9d08> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) 1459 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) 1460 at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747) 1461 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778) 1462 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114) 1463 at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186) 1464 at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262) 1465 at java.util.concurrent.DelayQueue.offer(DelayQueue.java:83) 1466 at java.util.concurrent.DelayQueue.add(DelayQueue.java:71) 1467 at org.apache.hadoop.hbase.regionserver.Leases.renewLease(Leases.java:194) A few threads are 'parking to wait for' 0x00002aaac04a9d08. This one is too: 599 "regionserver60020.leaseChecker" daemon prio=10 tid=0x0000000048db0800 nid=0x7548 waiting on condition [0x0000000044366000] 600 java.lang. Thread .State: WAITING (parking) 601 at sun.misc.Unsafe.park(Native Method) 602 - parking to wait for <0x00002aaac04a9d08> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) 603 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) 604 at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747) 605 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778) 606 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1971) 607 at java.util.concurrent.DelayQueue.poll(DelayQueue.java:209) 608 at org.apache.hadoop.hbase.regionserver.Leases.run(Leases.java:82) These are the only mentions of 0x00002aaac04a9d08 in all thread dumps. It sounds like this, http://bugs.sun.com/view_bug.do?bug_id=6822370 . Though the original report is on sun hw, later folks chime in w/ the issue on various linux. Setting membar seems to be a workaround, or running with u18/u21 where bug is purportedly fixed.
        Hide
        Dmitriy V. Ryaboy added a comment -

        We are deploying membar change, and adding monitoring for this condition; will update if it shows up again.

        Show
        Dmitriy V. Ryaboy added a comment - We are deploying membar change, and adding monitoring for this condition; will update if it shows up again.
        Hide
        stack added a comment -

        I pinged Mr D. Alls well since he added -XX:+UseMembar.

        Show
        stack added a comment - I pinged Mr D. Alls well since he added -XX:+UseMembar.
        Hide
        stack added a comment -

        Resolving as won't fix. Added FAQ to the book about hanging regionserver for no explicable reason.

        Show
        stack added a comment - Resolving as won't fix. Added FAQ to the book about hanging regionserver for no explicable reason.
        Hide
        Hudson added a comment -

        Integrated in HBase-TRUNK #1814 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/)

        Show
        Hudson added a comment - Integrated in HBase-TRUNK #1814 (See https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/ )

          People

          • Assignee:
            Unassigned
            Reporter:
            Jean-Daniel Cryans
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development