ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-126

zookeeper client close operation may block indefinitely

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 3.6.0
    • Component/s: java client
    • Labels:
      None

      Description

      Moving the hang issue from ZOOKEEPER-63 to here. See 63 for background and potential patch (patch_ZOOKEEPER-63.patch).

      specifically (from James):

      "I'm thinking the close() method should not wait() forever on the disconnect packet, just a closeTimeout length - say a few seconds. Afterall blocking and forcing a reconnect just to redeliver the disconnect packet seems a bit silly - when the server has to deal with clients which just have their sockets fail anyway"

      1. ZOOKEEPER-126.patch
        2 kB
        Patrick Hunt
      2. ZOOKEEPER-126.patch
        10 kB
        Matthias Spycher

        Issue Links

          Activity

          Patrick Hunt created issue -
          Patrick Hunt made changes -
          Field Original Value New Value
          Link This issue is related to ZOOKEEPER-63 [ ZOOKEEPER-63 ]
          Hide
          Patrick Hunt added a comment -

          Moved this patch from ZOOKEEPER-63 to here.

          Show
          Patrick Hunt added a comment - Moved this patch from ZOOKEEPER-63 to here.
          Patrick Hunt made changes -
          Attachment ZOOKEEPER-126.patch [ 12388954 ]
          Hide
          Mahadev konar added a comment -

          was this problem fixed with fixes to the client code in 3.0?

          Show
          Mahadev konar added a comment - was this problem fixed with fixes to the client code in 3.0?
          Mahadev konar made changes -
          Fix Version/s 3.3.0 [ 12313976 ]
          Hide
          Patrick Hunt added a comment -

          Not a blocker, pushing to 3.4.0

          Show
          Patrick Hunt added a comment - Not a blocker, pushing to 3.4.0
          Patrick Hunt made changes -
          Fix Version/s 3.4.0 [ 12314469 ]
          Fix Version/s 3.3.0 [ 12313976 ]
          Thomas Koch made changes -
          Link This issue is related to ZOOKEEPER-846 [ ZOOKEEPER-846 ]
          Mahadev konar made changes -
          Fix Version/s 3.5.0 [ 12316644 ]
          Fix Version/s 3.4.0 [ 12314469 ]
          Hide
          Matthias Spycher added a comment -

          The latest attachment works for 3.4. It uses a latch and timeout to prevent blocking – not sure if we require ZK to work on JREs without concurrency utils.

          It includes a test that introduces an awaitTermination() method to join send and event threads. This functionality is not exposed in the public API though users could subclass ZooKeeper and use it as this patch does for testing.

          Show
          Matthias Spycher added a comment - The latest attachment works for 3.4. It uses a latch and timeout to prevent blocking – not sure if we require ZK to work on JREs without concurrency utils. It includes a test that introduces an awaitTermination() method to join send and event threads. This functionality is not exposed in the public API though users could subclass ZooKeeper and use it as this patch does for testing.
          Matthias Spycher made changes -
          Attachment ZOOKEEPER-126.patch [ 12494819 ]
          Hide
          Daniel Lord added a comment -

          Hey guys, I just ran in to a cascading failure that ended up settling on this. The issue was my SendThread was killed by an NPE when trying to log some unknown exception. The line that caused the SendThread to die was:

          at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1161)

          This is using zookeeper 3.3.2. The NPE from the logging ends up killing the SendThread by an uncaught exception. Eventually the lack of a send thread causes the client to disconnect/expire. At this point the zookeeper client is close()'d. The close call will hang forever because the final packet is never sent so it is never ACK'd or notify()'d and we stall out forever.

          Should this NPE be filed as a new issue or are you all content that this timed close will solve the problem?

          Show
          Daniel Lord added a comment - Hey guys, I just ran in to a cascading failure that ended up settling on this. The issue was my SendThread was killed by an NPE when trying to log some unknown exception. The line that caused the SendThread to die was: at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1161) This is using zookeeper 3.3.2. The NPE from the logging ends up killing the SendThread by an uncaught exception. Eventually the lack of a send thread causes the client to disconnect/expire. At this point the zookeeper client is close()'d. The close call will hang forever because the final packet is never sent so it is never ACK'd or notify()'d and we stall out forever. Should this NPE be filed as a new issue or are you all content that this timed close will solve the problem?
          Hide
          Oliver Wulff added a comment -

          Why is this issue pushed to 3.5.0?

          I think I saw the same issue here with 3.3.3:
          java.lang.Object.wait(Native Method)
          java.lang.Object.wait(Object.java:485)
          org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1317)
          org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1295)
          org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:531)

          • locked org.apache.zookeeper.ZooKeeper@26827e

          The thread is blocked which impacts other threads then.

          My guess is that the network here is not very reliable and therefore we run into this blocking issue.

          IMHO, it makes perfect sense to not wait for unlimited time if closing the connection blocks.

          Show
          Oliver Wulff added a comment - Why is this issue pushed to 3.5.0? I think I saw the same issue here with 3.3.3: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1317) org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1295) org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:531) locked org.apache.zookeeper.ZooKeeper@26827e The thread is blocked which impacts other threads then. My guess is that the network here is not very reliable and therefore we run into this blocking issue. IMHO, it makes perfect sense to not wait for unlimited time if closing the connection blocks.
          Matthias Spycher made changes -
          Link This issue blocks ZOOKEEPER-1202 [ ZOOKEEPER-1202 ]
          Hide
          Matthias Spycher added a comment -

          Incorporates the same fix and more

          Show
          Matthias Spycher added a comment - Incorporates the same fix and more
          Matthias Spycher made changes -
          Link This issue blocks ZOOKEEPER-1202 [ ZOOKEEPER-1202 ]
          Matthias Spycher made changes -
          Link This issue is part of ZOOKEEPER-1215 [ ZOOKEEPER-1215 ]
          Matthias Spycher made changes -
          Link This issue is part of ZOOKEEPER-1215 [ ZOOKEEPER-1215 ]
          Matthias Spycher made changes -
          Link This issue is part of ZOOKEEPER-1202 [ ZOOKEEPER-1202 ]
          Hide
          Matthias Spycher added a comment -

          1202 incorporates the fix for 126 and more

          Show
          Matthias Spycher added a comment - 1202 incorporates the fix for 126 and more
          Hide
          Patrick Hunt added a comment -

          @oliver I think the issue is that limiting the wait time would change the semantics of the close call. Not something we typically do in a minor release.

          However I could see us adding a close(timeout) method which provides this capability. (probably would have been better to start with that, hindsight is always 20/20)

          @daniel I think you should open a jira on that NPE issue you mentioned (if you haven't already - although I don't see it "linked" to this jira)

          Show
          Patrick Hunt added a comment - @oliver I think the issue is that limiting the wait time would change the semantics of the close call. Not something we typically do in a minor release. However I could see us adding a close(timeout) method which provides this capability. (probably would have been better to start with that, hindsight is always 20/20) @daniel I think you should open a jira on that NPE issue you mentioned (if you haven't already - although I don't see it "linked" to this jira)
          Hide
          Oliver Wulff added a comment -

          I understand your concern. Any chance to get this into 3.3.5?

          It would be great if the closeTimeout could be configured in Zookeeper config file (ex. zoo.cfg).

          Show
          Oliver Wulff added a comment - I understand your concern. Any chance to get this into 3.3.5? It would be great if the closeTimeout could be configured in Zookeeper config file (ex. zoo.cfg).
          Hide
          Patrick Hunt added a comment -

          Oliver - a system property would be less invasive. However there is then the issue of supporting backward compatibility. (3.3.5 to 3.4 to 3.5/...)

          It seems to me that we need to address the underlying cause of the problem otherwise unexpected/bad behavior might happen that we don't handle. Say the SendThread exited case, if the EventThread is still running it's likely that we'll start leaking threads. Shouldn't we fix the root issue rather than papering over it?

          Show
          Patrick Hunt added a comment - Oliver - a system property would be less invasive. However there is then the issue of supporting backward compatibility. (3.3.5 to 3.4 to 3.5/...) It seems to me that we need to address the underlying cause of the problem otherwise unexpected/bad behavior might happen that we don't handle. Say the SendThread exited case, if the EventThread is still running it's likely that we'll start leaking threads. Shouldn't we fix the root issue rather than papering over it?
          Michi Mutsuzaki made changes -
          Fix Version/s 3.6.0 [ 12326518 ]
          Fix Version/s 3.5.0 [ 12316644 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Patrick Hunt
            • Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development