ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1100

Killed (or missing) SendThread will cause hanging threads

    Details

    • Hadoop Flags:
      Incompatible change

      Description

      After investigating an issues with hanging threads I noticed that any java.lang.Error might silently kill the SendThread. Without a SendThread any thread that wants to send something will hang forever.

      Currently nobody will recognize a SendThread that died. I think at least a state should be flipped (or flag should be set) that causes all further send attempts to fail or to re-spin the connection loop.

      1. ZOOKEEPER-1100.patch
        1 kB
        Rakesh R
      2. ZOOKEEPER-1100.patch
        3 kB
        Camille Fournier

        Issue Links

          Activity

          Hide
          Rakesh R added a comment -

          Hi,

          Approach: Added the 'finally block' for the SendThread, this will do the cleanup activities and do graceful shutdown. So all the successive operations will get ConectionLossException and will not be hanged forever.

          I have attached the patch, please review.

          Thanks in advance,
          Rakesh

          Show
          Rakesh R added a comment - Hi, Approach: Added the 'finally block' for the SendThread, this will do the cleanup activities and do graceful shutdown. So all the successive operations will get ConectionLossException and will not be hanged forever. I have attached the patch, please review. Thanks in advance, Rakesh
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12496627/ZOOKEEPER-1100.patch
          against trunk revision 1176159.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/591//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/591//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/591//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12496627/ZOOKEEPER-1100.patch against trunk revision 1176159. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/591//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/591//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/591//console This message is automatically generated.
          Hide
          Rakesh R added a comment -

          I failed to write a test case for this Error scenario. Manually verified the patch and causes no test failures. Any help would be appreciated.

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - I failed to write a test case for this Error scenario. Manually verified the patch and causes no test failures. Any help would be appreciated. Thanks, Rakesh
          Hide
          Camille Fournier added a comment -

          I'm reviewing this issue. Can I get some clarity? Is the issue that you get a runtime exception outside of the try block after while (state.isAlive()) so the thread dies and hangs? Why put the try block there instead of around the entire method?

          Show
          Camille Fournier added a comment - I'm reviewing this issue. Can I get some clarity? Is the issue that you get a runtime exception outside of the try block after while (state.isAlive()) so the thread dies and hangs? Why put the try block there instead of around the entire method?
          Hide
          Camille Fournier added a comment -

          More to the point, are you expecting just a watcher event for this? As it stands, if your send thread dies you will still have send requests hang even with a cleanup call because the state doesn't change to anything but CONNECTING. If just getting a watch event and notification on pending send requests is fine, then I think we can work with this.

          Show
          Camille Fournier added a comment - More to the point, are you expecting just a watcher event for this? As it stands, if your send thread dies you will still have send requests hang even with a cleanup call because the state doesn't change to anything but CONNECTING. If just getting a watch event and notification on pending send requests is fine, then I think we can work with this.
          Hide
          Camille Fournier added a comment -

          example test using mockito. Minus of course the mockito jar that would maek this work and the fixes it may or may not be testing.

          Show
          Camille Fournier added a comment - example test using mockito. Minus of course the mockito jar that would maek this work and the fixes it may or may not be testing.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12501834/ZOOKEEPER-1100.patch
          against trunk revision 1196025.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/757//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/757//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/757//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12501834/ZOOKEEPER-1100.patch against trunk revision 1196025. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/757//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/757//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/757//console This message is automatically generated.
          Hide
          Gunnar Wagenknecht added a comment -

          The issue we faced was that an OutOfMemoryError was thrown in the send thread which silently killed it. After that all other threads interacting with ZooKeeper just hung because there was no send thread and the ZooKeeper connection was still considered fine.

          Show
          Gunnar Wagenknecht added a comment - The issue we faced was that an OutOfMemoryError was thrown in the send thread which silently killed it. After that all other threads interacting with ZooKeeper just hung because there was no send thread and the ZooKeeper connection was still considered fine.
          Hide
          Mahadev konar added a comment -

          Camille,
          I dont think we have a dependency on mockito yet. I am adding one in ZOOKEEPER-1271.

          Show
          Mahadev konar added a comment - Camille, I dont think we have a dependency on mockito yet. I am adding one in ZOOKEEPER-1271 .
          Hide
          Mahadev konar added a comment -

          Rakesh,
          Can you please take a look at the failing test and also look at adding a test case?

          Show
          Mahadev konar added a comment - Rakesh, Can you please take a look at the failing test and also look at adding a test case?
          Hide
          Rakesh R added a comment -

          Mahadev/Camille/Gunnar,

          I have checked the trunk versions, currently ClientCnxn.SendThread.run() has been modified as follows by catching 'Throwable'. So the clients will get Disconnected events and the problem mentioned in the issue will never be occured.

          Earlier it was 'Exception' when the defect was raised. I think the patch neednot be required. Please give your opinion.

          void run() {
            while (zooKeeper.state.isAlive()) {
              try {
                ...
              } catch (Throwable e) {
                // handled all exceptions and sending Disconnected events. retrying
              }
            }
            ...
          }
          

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Mahadev/Camille/Gunnar, I have checked the trunk versions, currently ClientCnxn.SendThread.run() has been modified as follows by catching 'Throwable'. So the clients will get Disconnected events and the problem mentioned in the issue will never be occured. Earlier it was 'Exception' when the defect was raised. I think the patch neednot be required. Please give your opinion. void run() { while (zooKeeper.state.isAlive()) { try { ... } catch (Throwable e) { // handled all exceptions and sending Disconnected events. retrying } } ... } Thanks, Rakesh
          Hide
          Gunnar Wagenknecht added a comment -

          FWIW, in our case it was an OutOfMemoryError that killed the send thread. Catching Throwable might be ok but I wonder if a OutOfMemoryError should be handled differently, i.e. any subsequent memory allocation might fail because of lack of resources. Should an error be re-thrown?

          Show
          Gunnar Wagenknecht added a comment - FWIW, in our case it was an OutOfMemoryError that killed the send thread. Catching Throwable might be ok but I wonder if a OutOfMemoryError should be handled differently, i.e. any subsequent memory allocation might fail because of lack of resources. Should an error be re-thrown?
          Hide
          Rakesh R added a comment -

          Hi Gunnar,

          In case of OutOfMemoryError also:
          ZKClient will get the notification of 'Disconnected' event, here the other threads which are interacting with the ZooKeeper can listen for this event before any datatransfer. So the client applications can be able to gracefully handle the scenario of interacting threads.

          IMO, if required we can introduce 'timed reconnect' only for new clients, but it makes the client code more complex.

          Show
          Rakesh R added a comment - Hi Gunnar, In case of OutOfMemoryError also: ZKClient will get the notification of 'Disconnected' event, here the other threads which are interacting with the ZooKeeper can listen for this event before any datatransfer. So the client applications can be able to gracefully handle the scenario of interacting threads. IMO, if required we can introduce 'timed reconnect' only for new clients, but it makes the client code more complex.
          Hide
          Gunnar Wagenknecht added a comment -

          Previously the OOME escaped and the send thread died silently. If that's no longer the case then I think the issue can be marked as fixed (or works-for-me).

          Show
          Gunnar Wagenknecht added a comment - Previously the OOME escaped and the send thread died silently. If that's no longer the case then I think the issue can be marked as fixed (or works-for-me).
          Hide
          Rakesh R added a comment -

          Hi Gunnar, Thanks a lot for your comments!

          Just to sum up my current understanding, any exception/error conditions 'SendThread' would dispatch the 'Disconnected' event and retrying.

          I think it works. If no one else objects I'll mark this as fixed.

          Show
          Rakesh R added a comment - Hi Gunnar, Thanks a lot for your comments! Just to sum up my current understanding, any exception/error conditions 'SendThread' would dispatch the 'Disconnected' event and retrying. I think it works. If no one else objects I'll mark this as fixed.
          Hide
          Rakesh R added a comment -

          Hi Gunnar,

          Un-assigning myself. If you agree the comments, please mark as 'Resolved' and close it.
          I don't want to appear to be blocking it or such. Will keep watching to see if there is some way I can help out

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Hi Gunnar, Un-assigning myself. If you agree the comments, please mark as 'Resolved' and close it. I don't want to appear to be blocking it or such. Will keep watching to see if there is some way I can help out Thanks, Rakesh
          Hide
          Gunnar Wagenknecht added a comment -

          As a user an non-commiter I'm reluctant marking this issue resolved. Is that the common processes used at ZK?

          Show
          Gunnar Wagenknecht added a comment - As a user an non-commiter I'm reluctant marking this issue resolved. Is that the common processes used at ZK?
          Hide
          Camille Fournier added a comment -

          Seems like you all think this is a non-issue, so I will mark it as resolved. Please do feel free to re-open if you see the issue again.

          Show
          Camille Fournier added a comment - Seems like you all think this is a non-issue, so I will mark it as resolved. Please do feel free to re-open if you see the issue again.
          Hide
          Camille Fournier added a comment -

          Fixed by other refactorings and changes to client cnxn.

          Show
          Camille Fournier added a comment - Fixed by other refactorings and changes to client cnxn.
          Hide
          Jeremy Stribling added a comment -

          Fixed by other refactorings and changes to client cnxn.

          Did these refactorings happen in any 3.3.X or 3.4.X release? What JIRA are they related to? I think I am hitting this, and want to patch my system without upgrading to trunk. Do you all think I can just apply the attached patch to fix it? Or should I instead apply the patches Camille refers to above that are in trunk?

          Show
          Jeremy Stribling added a comment - Fixed by other refactorings and changes to client cnxn. Did these refactorings happen in any 3.3.X or 3.4.X release? What JIRA are they related to? I think I am hitting this, and want to patch my system without upgrading to trunk. Do you all think I can just apply the attached patch to fix it? Or should I instead apply the patches Camille refers to above that are in trunk?
          Hide
          Camille Fournier added a comment -

          3.4.X and trunk, I believe. Are you seeing it in 3.4.X? We did a big refactor between 3.3.X and 3.4... I can look for a jira if you're interested.

          Show
          Camille Fournier added a comment - 3.4.X and trunk, I believe. Are you seeing it in 3.4.X? We did a big refactor between 3.3.X and 3.4... I can look for a jira if you're interested.
          Hide
          Jeremy Stribling added a comment -

          No, this was on 3.3.3 (sorry, I should have mentioned that before). That's great to know, thanks!

          Show
          Jeremy Stribling added a comment - No, this was on 3.3.3 (sorry, I should have mentioned that before). That's great to know, thanks!
          Hide
          Rakesh R added a comment -

          Hi Jeremy, I have seen 'ClientCnxn.java' code has been refactored as part of ZOOKEEPER-1174 and has modified to Throwable instead of Exception block. Later when reviewing the code, I feel there is a minor chance of OOME due to object creation. Please refer ZOOKEEPER-1375 for more info.

          Show
          Rakesh R added a comment - Hi Jeremy, I have seen 'ClientCnxn.java' code has been refactored as part of ZOOKEEPER-1174 and has modified to Throwable instead of Exception block. Later when reviewing the code, I feel there is a minor chance of OOME due to object creation. Please refer ZOOKEEPER-1375 for more info.

            People

            • Assignee:
              Unassigned
              Reporter:
              Gunnar Wagenknecht
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development