ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1382

Zookeeper server holds onto dead/expired session ids in the watch data structures

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.5
    • Fix Version/s: 3.4.6, 3.5.0
    • Component/s: server
    • Labels:
      None

      Description

      I've observed that zookeeper server holds onto expired session ids in the watcher data structures. The result is the wchp command reports session ids that cannot be found through cons/dump and those expired session ids sit there maybe until the server is restarted. Here are snippets from the client and the server logs that lead to this state, for one particular session id 0x134485fd7bcb26f -

      There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 and I'm using ZkClient to connect to the cluster

      From the application log -

      application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] [main-SendThread(223.prod:12913)] [application Session establishment complete on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, negotiated timeout = 6000
      application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] [main-SendThread(223.prod:12913)] [application] Client session timed out, have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing socket connection and attempting reconnect
      application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] [main-SendThread(226.prod:12913)] [application] Unable to reconnect to ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket connection

      On the leader zk, 225 -

      zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, timeout of 6000ms exceeded
      zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination for sessionid: 0x134485fd7bcb26f

      On the server, the client was initially connected to, 223 -

      zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO [CommitProcessor:1:NIOServerCnxn@1580] - Established session 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
      zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f

      Here are the log snippets from 226, which is the server, the client reconnected to, before getting session expired event -

      2012-01-27 09:52:38,190 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
      2012-01-27 09:52:38,191 - INFO [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
      2012-01-27 09:52:38,191 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed socket connection for client /172.17.136.82:49367 which had sessionid 0x134485fd7bcb26f

      wchp output from 226, taken on 01/30 -

      nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 226.*wchp | wc -l
      3

      wchp output from 223, taken on 01/30 -

      nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 223.*wchp | wc -l
      0

      cons output from 223 and 226, taken on 01/30 -

      nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 226.*cons | wc -l
      0

      nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 223.*cons | wc -l
      0

      So, what seems to have happened is that the client was able to re-register the watches on the new server (226), after it got disconnected from 223, inspite of having an expired session id.

      In NIOServerCnxn, I saw that after suspecting that a session is expired, a server removes the cnxn and its watches from its internal data structures. But before that it allows more requests to be processed even if the session is expired -

      // Now that the session is ready we can start receiving packets
      synchronized (this.factory)

      { sk.selector().wakeup(); enableRecv(); }

      } catch (Exception e)

      { LOG.warn("Exception while establishing session, closing", e); close(); }

      I wonder if the client somehow sneaked in the set watches, right after the server removed the connection through removeCnxn() API ?

      1. ZOOKEEPER-1382_3.3.4.patch
        44 kB
        Neha Narkhede
      2. ZOOKEEPER-1382-branch-3.4.patch
        14 kB
        Michael Morello
      3. ZOOKEEPER-1382.patch
        19 kB
        Michael Morello
      4. ZOOKEEPER-1382-branch-3.4.patch
        15 kB
        Germán Blanco
      5. ZOOKEEPER-1382.patch
        21 kB
        Germán Blanco
      6. ZOOKEEPER-1382-branch-3.4.patch
        16 kB
        Germán Blanco
      7. ZOOKEEPER-1382.patch
        20 kB
        Germán Blanco
      8. ZOOKEEPER-1382.patch
        20 kB
        Germán Blanco
      9. ZOOKEEPER-1382-branch-3.4.patch
        16 kB
        Germán Blanco
      10. ZOOKEEPER-1382.patch
        20 kB
        Germán Blanco

        Activity

        Neha Narkhede created issue -
        Neha Narkhede made changes -
        Field Original Value New Value
        Assignee Neha Narkhede [ nehanarkhede ]
        Neha Narkhede made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Neha Narkhede added a comment -

        Found this bug on 3.3.4 and hence, am uploading a draft patch with a possible fix for 3.3.4 first. If it looks good, I'll try doing the same on trunk.

        Here is what I think is causing this bug. Please correct me if I'm wrong -

        Say a client connects to a > 1 node ZK cluster and registers watch on some path. After that, if the client goes into a GC pause, the leader expires the session and proposes closing the session to the followers. Meanwhile, the client wakes up and tries to renew its session with one of the followers. Now, when the follower detects that the session is invalid, it sends a close session response to the client, and enables receive on the selector key for that connection. So it receives the setWatches request during the renew session. There is a race condition since the request processors on the follower merely "pass-through" the non-transactional setWatches request while the NIOServerCnxn has sent the close session request to the client. Due to the above, sometimes, the watches are set right after the NIOServerCnxn is removed from the ZooKeeperServer. Hence, the watches are left on the servers, for expired sessions, effectively causing a memory leak.

        The fix is to enable reads on the connection only if the session is found to be valid.

        For testing, I wrote a test and added it as part of the DisconnectedWatcherTest. However, it uses the in-memory ZK ensemble class from curator, with a few modifications to test watches. In addition to this, have added a timeoutSession() API to TestableZooKeeper to simulate the client side GC pause. However, it involves suspending and resuming the event thread.

        I have 2 concerns about tests for this patch -
        1. Method used to simulate a client side GC pause
        2. Longer test duration, since the test runs multiple iterations to reproduce this corner case race condition

        For 1, this is hacky, works, but it will be good to know other alternatives for writing this test.
        For 2, since this is more of a system test, wonder what is the best way to include it in zookeeper. Also, the new helper classes are from curator.

        Lastly, I left the DEBUG/TRACE logging, that I found useful to understand the server code, in the patch. Maybe, not all of it follows logging conventions.

        Show
        Neha Narkhede added a comment - Found this bug on 3.3.4 and hence, am uploading a draft patch with a possible fix for 3.3.4 first. If it looks good, I'll try doing the same on trunk. Here is what I think is causing this bug. Please correct me if I'm wrong - Say a client connects to a > 1 node ZK cluster and registers watch on some path. After that, if the client goes into a GC pause, the leader expires the session and proposes closing the session to the followers. Meanwhile, the client wakes up and tries to renew its session with one of the followers. Now, when the follower detects that the session is invalid, it sends a close session response to the client, and enables receive on the selector key for that connection. So it receives the setWatches request during the renew session. There is a race condition since the request processors on the follower merely "pass-through" the non-transactional setWatches request while the NIOServerCnxn has sent the close session request to the client. Due to the above, sometimes, the watches are set right after the NIOServerCnxn is removed from the ZooKeeperServer. Hence, the watches are left on the servers, for expired sessions, effectively causing a memory leak. The fix is to enable reads on the connection only if the session is found to be valid. For testing, I wrote a test and added it as part of the DisconnectedWatcherTest. However, it uses the in-memory ZK ensemble class from curator, with a few modifications to test watches. In addition to this, have added a timeoutSession() API to TestableZooKeeper to simulate the client side GC pause. However, it involves suspending and resuming the event thread. I have 2 concerns about tests for this patch - 1. Method used to simulate a client side GC pause 2. Longer test duration, since the test runs multiple iterations to reproduce this corner case race condition For 1, this is hacky, works, but it will be good to know other alternatives for writing this test. For 2, since this is more of a system test, wonder what is the best way to include it in zookeeper. Also, the new helper classes are from curator. Lastly, I left the DEBUG/TRACE logging, that I found useful to understand the server code, in the patch. Maybe, not all of it follows logging conventions.
        Neha Narkhede made changes -
        Attachment ZOOKEEPER-1382_3.3.4.patch [ 12514699 ]
        Neha Narkhede made changes -
        Status In Progress [ 3 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12514699/ZOOKEEPER-1382_3.3.4.patch
        against trunk revision 1242982.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 16 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/956//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12514699/ZOOKEEPER-1382_3.3.4.patch against trunk revision 1242982. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 16 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/956//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        This is a lot of change for a fix that seems to be really small. Can you put this into reviewboard for more careful review? I'm not sure we will want all the logging changes so you might want to go through and trim that stuff up before putting it up there. Thanks!

        Show
        Camille Fournier added a comment - This is a lot of change for a fix that seems to be really small. Can you put this into reviewboard for more careful review? I'm not sure we will want all the logging changes so you might want to go through and trim that stuff up before putting it up there. Thanks!
        Hide
        Patrick Hunt added a comment -

        this is a good one to get in, see the feedback provided so far though. (also, likely we'll need a patch for post 3.3).

        Show
        Patrick Hunt added a comment - this is a good one to get in, see the feedback provided so far though. (also, likely we'll need a patch for post 3.3).
        Patrick Hunt made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Patrick Hunt made changes -
        Fix Version/s 3.3.6 [ 12320172 ]
        Fix Version/s 3.4.4 [ 12319841 ]
        Fix Version/s 3.5.0 [ 12316644 ]
        Priority Major [ 3 ] Critical [ 2 ]
        Hide
        Mahadev konar added a comment -

        Moving it out, Neha can you please provide a patch for trunk and 3.4 as well? Good to see the tests that you added.

        Show
        Mahadev konar added a comment - Moving it out, Neha can you please provide a patch for trunk and 3.4 as well? Good to see the tests that you added.
        Mahadev konar made changes -
        Fix Version/s 3.3.7 [ 12321882 ]
        Fix Version/s 3.3.6 [ 12320172 ]
        Hide
        Mahadev konar added a comment -

        Moving it out again, Neha any chance you will have some bandwidth to follow up on this?

        Show
        Mahadev konar added a comment - Moving it out again, Neha any chance you will have some bandwidth to follow up on this?
        Mahadev konar made changes -
        Fix Version/s 3.4.5 [ 12321883 ]
        Fix Version/s 3.4.4 [ 12319841 ]
        Mahadev konar made changes -
        Fix Version/s 3.4.6 [ 12323310 ]
        Fix Version/s 3.5.0 [ 12316644 ]
        Fix Version/s 3.3.7 [ 12321882 ]
        Fix Version/s 3.4.5 [ 12321883 ]
        Hide
        Michael Morello added a comment -

        Since we monitor sessions and watches we note that we encounter this problem from time to time.
        This is not a big issue but it causes a memory leak.
        I'll try to see if I can work on it but it does seem hard to reproduce in a unit test.

        Show
        Michael Morello added a comment - Since we monitor sessions and watches we note that we encounter this problem from time to time. This is not a big issue but it causes a memory leak. I'll try to see if I can work on it but it does seem hard to reproduce in a unit test.
        Michael Morello made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 3.4.5 [ 12321883 ]
        Affects Version/s 3.3.4 [ 12316276 ]
        Hide
        Michael Morello added a comment -

        I managed to create a JUnit test case that demonstrate the "watches leak", unfortunately i had to promote a few methods to the public scope.

        As Neha pointed out the unit test shows that if a client send a connection request with an expired session the NIOServerCnxn class still processes upcoming messages like OpCode.setWatches.

        Actually the fix simply consists to call cnxn.enableRecv() only if the session has been validated by the leader.

        Show
        Michael Morello added a comment - I managed to create a JUnit test case that demonstrate the "watches leak", unfortunately i had to promote a few methods to the public scope. As Neha pointed out the unit test shows that if a client send a connection request with an expired session the NIOServerCnxn class still processes upcoming messages like OpCode.setWatches. Actually the fix simply consists to call cnxn.enableRecv() only if the session has been validated by the leader.
        Michael Morello made changes -
        Attachment ZOOKEEPER-1382.patch [ 12571815 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12571815/ZOOKEEPER-1382.patch
        against trunk revision 1448007.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1419//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12571815/ZOOKEEPER-1382.patch against trunk revision 1448007. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1419//console This message is automatically generated.
        Michael Morello made changes -
        Attachment ZOOKEEPER-1382.patch [ 12571815 ]
        Hide
        Michael Morello added a comment -

        Current patch is for branch 3.4
        Sorry

        Show
        Michael Morello added a comment - Current patch is for branch 3.4 Sorry
        Michael Morello made changes -
        Attachment ZOOKEEPER-1382-branch-3.4.patch [ 12571817 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12571817/ZOOKEEPER-1382-branch-3.4.patch
        against trunk revision 1448007.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1420//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12571817/ZOOKEEPER-1382-branch-3.4.patch against trunk revision 1448007. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1420//console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        Michael,
        Would you be able to upload a patch for trunk as well?

        Show
        Mahadev konar added a comment - Michael, Would you be able to upload a patch for trunk as well?
        Hide
        Michael Morello added a comment -

        It seems that the behavior of disableRecv() / enableRecv() has changed a bit in 3.5 : there is a new thread (SelectorThread) in charge of interest op updates and updates are processed in a non blocking fashion (updates are queued to be handled later by the SelectorThread, if i understand correctly it is more a throttle mechanism than something else). The unit test is no more relevant.

        If we want to fix the two branches (3.4 and 3.5) it is imho no more a good idea to rely on enableRecv().
        May be it would be cleaner to introduce a new function ServerCnxn.setValidSession(boolean valid) that informs the implementation (NIOServerCnxn or NettyServerCnxn) that the session has been (re)validated and then allow subsequent messages to be processed.

        If you don't mind i will try to provide an other patch and a unit test for both branches.

        Show
        Michael Morello added a comment - It seems that the behavior of disableRecv() / enableRecv() has changed a bit in 3.5 : there is a new thread (SelectorThread) in charge of interest op updates and updates are processed in a non blocking fashion (updates are queued to be handled later by the SelectorThread, if i understand correctly it is more a throttle mechanism than something else). The unit test is no more relevant. If we want to fix the two branches (3.4 and 3.5) it is imho no more a good idea to rely on enableRecv(). May be it would be cleaner to introduce a new function ServerCnxn.setValidSession(boolean valid) that informs the implementation (NIOServerCnxn or NettyServerCnxn) that the session has been (re)validated and then allow subsequent messages to be processed. If you don't mind i will try to provide an other patch and a unit test for both branches.
        Hide
        Thawan Kooburat added a comment -

        Can you explain a bit more why we cannot rely on disableRecv() / enableRecv() in 3.5 to fix this issue?

        From my understanding, although the request is now process in non-blocking fashion, we are calling disableRecv() while processing a ConnectRequest packet, so the SelectorThread won't try to select this socket again for reading after we finished processing the packet.

        Show
        Thawan Kooburat added a comment - Can you explain a bit more why we cannot rely on disableRecv() / enableRecv() in 3.5 to fix this issue? From my understanding, although the request is now process in non-blocking fashion, we are calling disableRecv() while processing a ConnectRequest packet, so the SelectorThread won't try to select this socket again for reading after we finished processing the packet.
        Hide
        Michael Morello added a comment -

        My bad, I have not spend enough time on the new branch, you are right. The same fix could be applied.
        Nevertheless the unit test is no more relevant.
        I'll try to work on that.

        Show
        Michael Morello added a comment - My bad, I have not spend enough time on the new branch, you are right. The same fix could be applied. Nevertheless the unit test is no more relevant. I'll try to work on that.
        Michael Morello made changes -
        Attachment ZOOKEEPER-1382.patch [ 12572967 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12572967/ZOOKEEPER-1382.patch
        against trunk revision 1453693.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1427//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1427//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1427//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12572967/ZOOKEEPER-1382.patch against trunk revision 1453693. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1427//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1427//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1427//console This message is automatically generated.
        Hide
        Michael Morello added a comment -

        testTransactionLogCorruption fails, not sure if is related to the patch or if it's because of ZOOKEEPER-1629

        Show
        Michael Morello added a comment - testTransactionLogCorruption fails, not sure if is related to the patch or if it's because of ZOOKEEPER-1629
        Hide
        Camille Fournier added a comment -

        I can't even get the TruncateCorruptionTest to run to completion or failure at all with or without this patch. I think we need to open a separate bug report on it. Besides that, this fix looks good to me.

        Show
        Camille Fournier added a comment - I can't even get the TruncateCorruptionTest to run to completion or failure at all with or without this patch. I think we need to open a separate bug report on it. Besides that, this fix looks good to me.
        Hide
        Alexander Shraer added a comment - - edited

        I've opened a bug for this some time ago: ZOOKEEPER-1629 perhaps we should disable this test for now ? for example, if one of the Java tests fails, the C tests don't run at all.

        Show
        Alexander Shraer added a comment - - edited I've opened a bug for this some time ago: ZOOKEEPER-1629 perhaps we should disable this test for now ? for example, if one of the Java tests fails, the C tests don't run at all.
        Hide
        Flavio Junqueira added a comment -

        The fix looks good to me overall, I just have a couple of comments:

        1. Some classes have been made public for testing. Could we use mock classes instead (classes that extend the target ZK classes)?
        2. We don't really name the test classes after the jiras that originated them. Could we use a more expressive name here?
        Show
        Flavio Junqueira added a comment - The fix looks good to me overall, I just have a couple of comments: Some classes have been made public for testing. Could we use mock classes instead (classes that extend the target ZK classes)? We don't really name the test classes after the jiras that originated them. Could we use a more expressive name here?
        Flavio Junqueira made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Germán Blanco added a comment -

        I would like to propose to skip the tests cases for this JIRA.
        The modification required is minimal and it seems clearly correct. However, trying to build a test that works and that doesn't require future maintenance seems to be very complex.
        I have done the mock classes as suggested by Flavio above, but the tests don't work any more (at least in trunk). My guess is that it is because of latest changes in the code. I don't see an easy fix (at least easy for me), and what is worse, it seems to me that these test cases will require review very often, since they depend very deeply on how things are implemented.

        Show
        Germán Blanco added a comment - I would like to propose to skip the tests cases for this JIRA. The modification required is minimal and it seems clearly correct. However, trying to build a test that works and that doesn't require future maintenance seems to be very complex. I have done the mock classes as suggested by Flavio above, but the tests don't work any more (at least in trunk). My guess is that it is because of latest changes in the code. I don't see an easy fix (at least easy for me), and what is worse, it seems to me that these test cases will require review very often, since they depend very deeply on how things are implemented.
        Hide
        Antonio added a comment -

        Hi, is ZooKeeper 3.4.3 affected by this bug?

        Show
        Antonio added a comment - Hi, is ZooKeeper 3.4.3 affected by this bug?
        Hide
        Germán Blanco added a comment -

        Yes

        Show
        Germán Blanco added a comment - Yes
        Germán Blanco made changes -
        Assignee Neha Narkhede [ nehanarkhede ] Germán Blanco [ abranzyck ]
        Germán Blanco made changes -
        Attachment ZOOKEEPER-1382-branch-3.4.patch [ 12609579 ]
        Hide
        Germán Blanco added a comment -

        No luck with the previous lazy proposal, so I think this is it.

        Show
        Germán Blanco added a comment - No luck with the previous lazy proposal, so I think this is it.
        Germán Blanco made changes -
        Attachment ZOOKEEPER-1382.patch [ 12609581 ]
        Germán Blanco made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 3.5.0 [ 12316644 ]
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12609581/ZOOKEEPER-1382.patch
        against trunk revision 1534390.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1714//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1714//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1714//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609581/ZOOKEEPER-1382.patch against trunk revision 1534390. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1714//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1714//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1714//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Given that the Mock* classes are only used in WatchLeakTest, I was wondering if it wouldn't be best to have them defined inside the WatchLeakTest class.

        Also, there is a test case not running because @Test is commented out.

        Show
        Flavio Junqueira added a comment - Given that the Mock* classes are only used in WatchLeakTest, I was wondering if it wouldn't be best to have them defined inside the WatchLeakTest class. Also, there is a test case not running because @Test is commented out.
        Flavio Junqueira made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Germán Blanco made changes -
        Attachment ZOOKEEPER-1382-branch-3.4.patch [ 12610261 ]
        Hide
        Germán Blanco added a comment -

        At some point I commented that out. I thought that the test case didn't make much sense since it was passing both with and without the change. It is also not there in the 3.4 version. But this was hiding an error in the test cases.
        Now it is fixed.
        I had to make a small modification in NIOServerCnxn.java in order to allow the test to work. SocketChannel.isOpen is a final method and can't be mocked by Mockito.
        The Mock classes can't be moved to a different package. Otherwise they don't have access to the protected elements in that package any more.

        Show
        Germán Blanco added a comment - At some point I commented that out. I thought that the test case didn't make much sense since it was passing both with and without the change. It is also not there in the 3.4 version. But this was hiding an error in the test cases. Now it is fixed. I had to make a small modification in NIOServerCnxn.java in order to allow the test to work. SocketChannel.isOpen is a final method and can't be mocked by Mockito. The Mock classes can't be moved to a different package. Otherwise they don't have access to the protected elements in that package any more.
        Germán Blanco made changes -
        Attachment ZOOKEEPER-1382.patch [ 12610262 ]
        Germán Blanco made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12610262/ZOOKEEPER-1382.patch
        against trunk revision 1535491.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1722//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1722//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1722//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12610262/ZOOKEEPER-1382.patch against trunk revision 1535491. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1722//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1722//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1722//console This message is automatically generated.
        Germán Blanco made changes -
        Link This issue relates to ZOOKEEPER-866 [ ZOOKEEPER-866 ]
        Hide
        Germán Blanco added a comment -

        I wouldn't recommend to run zookeeper using ramdisk without having an additional method that ensures consistency. Ramdisk means data is not persistent after a reboot, and this JIRA shows that losing data in one node when the quorum was in minority may lead to permanent inconsistencies in the quorum.

        Show
        Germán Blanco added a comment - I wouldn't recommend to run zookeeper using ramdisk without having an additional method that ensures consistency. Ramdisk means data is not persistent after a reboot, and this JIRA shows that losing data in one node when the quorum was in minority may lead to permanent inconsistencies in the quorum.
        Germán Blanco made changes -
        Link This issue relates to ZOOKEEPER-866 [ ZOOKEEPER-866 ]
        Hide
        Germán Blanco added a comment -

        Added a review request for this https://reviews.apache.org/r/15259/

        Show
        Germán Blanco added a comment - Added a review request for this https://reviews.apache.org/r/15259/
        Hide
        Germán Blanco added a comment -

        spaces and style changes

        Show
        Germán Blanco added a comment - spaces and style changes
        Germán Blanco made changes -
        Attachment ZOOKEEPER-1382.patch [ 12612572 ]
        Germán Blanco made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Germán Blanco made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12612572/ZOOKEEPER-1382.patch
        against trunk revision 1539529.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1747//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1747//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1747//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12612572/ZOOKEEPER-1382.patch against trunk revision 1539529. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1747//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1747//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1747//console This message is automatically generated.
        Hide
        Germán Blanco added a comment -

        I forgot to thank Raul Gutierrez Segales for reviewing the patch. ... dicen que más vale tarde que nunca.

        Show
        Germán Blanco added a comment - I forgot to thank Raul Gutierrez Segales for reviewing the patch. ... dicen que más vale tarde que nunca.
        Hide
        Germán Blanco added a comment -

        Flavio Junqueira, could you please take a look at this and say if it is ok now?

        Show
        Germán Blanco added a comment - Flavio Junqueira , could you please take a look at this and say if it is ok now?
        Hide
        Rakesh R added a comment -

        Hi German,
        Patch looks very nice, especially test cases are really good

        Add few suggestions:

        • Please add message while asserting, this will give more clarity, like
          assertEquals("There must be NO watches!", 0, watchCount);
          
        • Formatting is not proper in MockSelectorThread.java. Normally, between methods there will be a newline.
        • This is a suggestion and FYI. I would like to use @RunWith(Parameterized.class), when there are situations to run the same test case with different input values. In the patch, we have testcases with sessiontimeout and without sessiontimeout flags. Please refer MultiTransactionTest to understand more on this pattern. This way it will again make the test cases very simple.
        Show
        Rakesh R added a comment - Hi German, Patch looks very nice, especially test cases are really good Add few suggestions: Please add message while asserting, this will give more clarity, like assertEquals( "There must be NO watches!" , 0, watchCount); Formatting is not proper in MockSelectorThread.java. Normally, between methods there will be a newline. This is a suggestion and FYI. I would like to use @RunWith(Parameterized.class), when there are situations to run the same test case with different input values. In the patch, we have testcases with sessiontimeout and without sessiontimeout flags. Please refer MultiTransactionTest to understand more on this pattern. This way it will again make the test cases very simple.
        Hide
        Germán Blanco added a comment -

        Thank you Rakesh R, but the test cases are originally from someone else, Michael Morello I believe. And I agree, it seems very difficult to come up with a way to test this and he has done a good job.
        And thank you for the suggestions, I will try to take care of them in a few days and propose a new patch.

        Show
        Germán Blanco added a comment - Thank you Rakesh R , but the test cases are originally from someone else, Michael Morello I believe. And I agree, it seems very difficult to come up with a way to test this and he has done a good job. And thank you for the suggestions, I will try to take care of them in a few days and propose a new patch.
        Germán Blanco made changes -
        Attachment ZOOKEEPER-1382-branch-3.4.patch [ 12617335 ]
        Hide
        Germán Blanco added a comment -

        Updated with last comments.

        Show
        Germán Blanco added a comment - Updated with last comments.
        Germán Blanco made changes -
        Attachment ZOOKEEPER-1382.patch [ 12617336 ]
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12617336/ZOOKEEPER-1382.patch
        against trunk revision 1547702.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 12 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1818//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1818//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1818//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617336/ZOOKEEPER-1382.patch against trunk revision 1547702. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1818//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1818//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1818//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        Pretty sweet tests. I think this looks good on trunk, will check it in.

        Show
        Camille Fournier added a comment - Pretty sweet tests. I think this looks good on trunk, will check it in.
        Hide
        Camille Fournier added a comment -

        Looks good on 3.4 as well, will close this as soon as builds complete successfully. Thanks to all who helped with this!

        Show
        Camille Fournier added a comment - Looks good on 3.4 as well, will close this as soon as builds complete successfully. Thanks to all who helped with this!
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in ZooKeeper-trunk #2150 (See https://builds.apache.org/job/ZooKeeper-trunk/2150/)
        ZOOKEEPER-1382. Zookeeper server holds onto dead/expired session ids in the watch data structures
        (Germán Blanco and Michael Morello via camille) (camille: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1550213)

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/MockPacket.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/MockNIOServerCnxn.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/MockSelectorThread.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/WatchLeakTest.java
        Show
        Hudson added a comment - SUCCESS: Integrated in ZooKeeper-trunk #2150 (See https://builds.apache.org/job/ZooKeeper-trunk/2150/ ) ZOOKEEPER-1382 . Zookeeper server holds onto dead/expired session ids in the watch data structures (Germán Blanco and Michael Morello via camille) (camille: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1550213 ) /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/MockPacket.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/MockNIOServerCnxn.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/MockSelectorThread.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/WatchLeakTest.java
        Camille Fournier made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Germán Blanco added a comment -

        Thank you Camille Fournier!

        Show
        Germán Blanco added a comment - Thank you Camille Fournier !
        Hide
        Flavio Junqueira added a comment -

        Closing issues after releasing 3.4.6.

        Show
        Flavio Junqueira added a comment - Closing issues after releasing 3.4.6.
        Flavio Junqueira made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Germán Blanco
            Reporter:
            Neha Narkhede
          • Votes:
            2 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development