ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-846

zookeeper client doesn't shut down cleanly on the close call

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.2.2
    • Fix Version/s: 3.3.2, 3.4.0
    • Component/s: java client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Using HBase 0.20.6 (with HBASE-2473) we encountered a situation where Regionserver
      process was shutting down and seemed to hang.

      Here is the bottom of region server log:
      http://pastebin.com/YYawJ4jA

      zookeeper-3.2.2 is used.

      Here is relevant portion from jstack - I attempted to attach jstack twice in my email to dev@hbase.apache.org but failed:

      "DestroyJavaVM" prio=10 tid=0x00002aabb849c800 nid=0x6c60 waiting on condition [0x0000000000000000]
      java.lang.Thread.State: RUNNABLE

      "regionserver/10.32.42.245:60020" prio=10 tid=0x00002aabb84ce000 nid=0x6c81 in Object.wait() [0x0000000043755000]
      java.lang.Thread.State: WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)

      • waiting on <0x00002aaab76633c0> (a org.apache.zookeeper.ClientCnxn$Packet)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1099)
      • locked <0x00002aaab76633c0> (a org.apache.zookeeper.ClientCnxn$Packet)
        at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1077)
        at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:505)
      • locked <0x00002aaabf5e0c30> (a org.apache.zookeeper.ZooKeeper)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.close(ZooKeeperWrapper.java:681)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:654)
        at java.lang.Thread.run(Thread.java:619)

      "main-EventThread" daemon prio=10 tid=0x0000000043474000 nid=0x6c80 waiting on condition [0x00000000413f3000]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x00002aaabf6e9150> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:414)
      1. ZOOKEEPER-846.patch
        1 kB
        Patrick Hunt
      2. rs-13.stack
        10 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          Ted Yu added a comment -

          jstack for Region Server

          Show
          Ted Yu added a comment - jstack for Region Server
          Hide
          Mahadev konar added a comment -

          We should fix this in 3.3.2 if this still exists.

          Show
          Mahadev konar added a comment - We should fix this in 3.3.2 if this still exists.
          Hide
          Patrick Hunt added a comment -

          might be related to HBASE-2966, upping priority

          Show
          Patrick Hunt added a comment - might be related to HBASE-2966 , upping priority
          Hide
          Patrick Hunt added a comment -

          I believe I found the issue. Both this specific jira and HBASE-2966.

          During session close we set the "closing" flag, then later set the ZK state to CLOSED. zk client operations track outstanding
          requests to the server on the "outstandingqueue". There's a timing window where we can clear the outstanding
          queue (caused by the "closing" flag being set AND an error occurs on the socket) and exit the sendthread. Subsequent
          to this if an operation runs before the ZK state is set to CLOSED it will queue additional packets to the outstandingqueue.
          These packets will never be processed, and as a result the original operation will never complete.

          in queuepacket we need to check if closing is set before queueing a packet (we currently check the state only)

          I'm working on a test/fix.

          Show
          Patrick Hunt added a comment - I believe I found the issue. Both this specific jira and HBASE-2966 . During session close we set the "closing" flag, then later set the ZK state to CLOSED. zk client operations track outstanding requests to the server on the "outstandingqueue". There's a timing window where we can clear the outstanding queue (caused by the "closing" flag being set AND an error occurs on the socket) and exit the sendthread. Subsequent to this if an operation runs before the ZK state is set to CLOSED it will queue additional packets to the outstandingqueue. These packets will never be processed, and as a result the original operation will never complete. in queuepacket we need to check if closing is set before queueing a packet (we currently check the state only) I'm working on a test/fix.
          Hide
          Thomas Koch added a comment -

          ZOOKEEPER-126 and ZOOKEEPER-846 both have the same cause.

          Show
          Thomas Koch added a comment - ZOOKEEPER-126 and ZOOKEEPER-846 both have the same cause.
          Hide
          Patrick Hunt added a comment -

          This patch fixes the problem by adding a check for close in queuepacket.

          Note that this also moves the close=true into queuepacket (otw the closeSession would never be queued during a close.

          I spend > 1 day trying to craft a test for this, unfortunately due to the small size of the timing window I wasn't able to get a test that failed with this case. However by inspection it seems to address a clear bug. All tests are passing.

          Show
          Patrick Hunt added a comment - This patch fixes the problem by adding a check for close in queuepacket. Note that this also moves the close=true into queuepacket (otw the closeSession would never be queued during a close. I spend > 1 day trying to craft a test for this, unfortunately due to the small size of the timing window I wasn't able to get a test that failed with this case. However by inspection it seems to address a clear bug. All tests are passing.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12454491/ZOOKEEPER-846.patch
          against trunk revision 997192.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/111/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/111/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/111/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12454491/ZOOKEEPER-846.patch against trunk revision 997192. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/111/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/111/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/111/console This message is automatically generated.
          Hide
          Benjamin Reed added a comment -

          +1 looks good pat! it's nice that the checking and setting of closing is in the same routine. i agreed about skipping the test case.

          Show
          Benjamin Reed added a comment - +1 looks good pat! it's nice that the checking and setting of closing is in the same routine. i agreed about skipping the test case.
          Hide
          Patrick Hunt added a comment -

          committed to trunk and branch33

          Show
          Patrick Hunt added a comment - committed to trunk and branch33
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #944 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/944/)
          ZOOKEEPER-846. zookeeper client doesn't shut down cleanly on the close call

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #944 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/944/ ) ZOOKEEPER-846 . zookeeper client doesn't shut down cleanly on the close call

            People

            • Assignee:
              Patrick Hunt
              Reporter:
              Ted Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development