ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1361

Leader.lead iterates over 'learners' set without proper synchronisation

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.4.2
    • Fix Version/s: 3.4.4, 3.5.0
    • Component/s: None
    • Labels:
      None

      Description

      This block:

      HashSet<Long> followerSet = new HashSet<Long>();
      for(LearnerHandler f : learners)
          followerSet.add(f.getSid());
      

      is executed without holding the lock on learners, so if there were ever a condition where a new learner was added during the initial sync phase, I'm pretty sure we'd see a concurrent modification exception. Certainly other parts of the code are very careful to lock on learners when iterating.

      It would be nice to use a ConcurrentHashMap to hold the learners instead, but I can't convince myself that this wouldn't introduce some correctness bugs. For example the following:

      Learners contains A, B, C, D
      Thread 1 iterates over learners, and gets as far as B.
      Thread 2 removes A, and adds E.
      Thread 1 continues iterating and sees a learner view of A, B, C, D, E

      This may be a bug if Thread 1 is counting the number of synced followers for a quorum count, since at no point was A, B, C, D, E a correct view of the quorum.

      In practice, I think this is actually ok, because I don't think ZK makes any strong ordering guarantees on learners joining or leaving (so we don't need a strong serialisability guarantee on learners) but I don't think I'll make that change for this patch. Instead I want to clean up the locking protocols on the follower / learner sets - to avoid another easy deadlock like the one we saw in ZOOKEEPER-1294 - and to do less with the lock held; i.e. to copy and then iterate over the copy rather than iterate over a locked set.

      1. zk-memory-leak-fix.patch
        11 kB
        Ross Cohen
      2. ZOOKEEPER-1361.patch
        32 kB
        Henry Robinson
      3. ZOOKEEPER-1361-3.4.patch
        11 kB
        Mahadev konar
      4. ZOOKEEPER-1361-3.4.patch
        10 kB
        Camille Fournier
      5. ZOOKEEPER-1361-no-whitespace.patch
        10 kB
        Henry Robinson

        Activity

        Hide
        Henry Robinson added a comment -

        This patch cleans up some of the synchronisation around Leader's member variables:

        • Member variables are now private to Leader, and are accessed with getters
        • Synchronisation (mostly) happens only in those getters, so locks are held for less long
        • As a result, all accesses are now correctly synchronised

        Note that the behaviour of the code has changed slightly: where previously some methods would hold onto e.g. the Leader.learners lock for the entire loop, now the same method will copy the list of learners and then iterate over the copy without holding the lock.

        Show
        Henry Robinson added a comment - This patch cleans up some of the synchronisation around Leader's member variables: Member variables are now private to Leader, and are accessed with getters Synchronisation (mostly) happens only in those getters, so locks are held for less long As a result, all accesses are now correctly synchronised Note that the behaviour of the code has changed slightly: where previously some methods would hold onto e.g. the Leader.learners lock for the entire loop, now the same method will copy the list of learners and then iterate over the copy without holding the lock.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12510631/ZOOKEEPER-1361.patch
        against trunk revision 1231389.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/906//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/906//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/906//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510631/ZOOKEEPER-1361.patch against trunk revision 1231389. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/906//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/906//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/906//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        If we're going to do all these whitespace changes, it's going to make changes to both 3.4 and trunk difficult. I am really not fond of changing all the whitespace in a file for a simple checkin. Can we get this patch generated without the whitespace changes?

        Show
        Camille Fournier added a comment - If we're going to do all these whitespace changes, it's going to make changes to both 3.4 and trunk difficult. I am really not fond of changing all the whitespace in a file for a simple checkin. Can we get this patch generated without the whitespace changes?
        Hide
        Henry Robinson added a comment -

        Try this one, should have the whitespace stripped.

        Show
        Henry Robinson added a comment - Try this one, should have the whitespace stripped.
        Hide
        Camille Fournier added a comment -

        I can't get it to apply to either 3.4 or trunk...

        Show
        Camille Fournier added a comment - I can't get it to apply to either 3.4 or trunk...
        Hide
        Henry Robinson added a comment -

        It applies to trunk for me (I don't know that this has to into 3.4, since we've not seen bug reports on this issue).

        My trunk is at ZOOKEEPER-1386, and I can apply the patch cleanly with

        patch -p0 < ZOOKEEPER-1361-no-whitespace.patch

        Am I out of date?

        Show
        Henry Robinson added a comment - It applies to trunk for me (I don't know that this has to into 3.4, since we've not seen bug reports on this issue). My trunk is at ZOOKEEPER-1386 , and I can apply the patch cleanly with patch -p0 < ZOOKEEPER-1361 -no-whitespace.patch Am I out of date?
        Hide
        Camille Fournier added a comment -

        No sorry that was my mistake. Ok this is looking good I will check it in.

        Show
        Camille Fournier added a comment - No sorry that was my mistake. Ok this is looking good I will check it in.
        Hide
        Henry Robinson added a comment -

        Awesome, thanks Camille.

        Show
        Henry Robinson added a comment - Awesome, thanks Camille.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12516129/ZOOKEEPER-1361-no-whitespace.patch
        against trunk revision 1293982.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/961//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/961//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/961//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12516129/ZOOKEEPER-1361-no-whitespace.patch against trunk revision 1293982. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/961//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/961//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/961//console This message is automatically generated.
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1472 (See https://builds.apache.org/job/ZooKeeper-trunk/1472/)
        ZOOKEEPER-1361. Leader.lead iterates over 'learners' set without proper synchronisation (henryr via camille) (Revision 1294000)

        Result = SUCCESS
        camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294000
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/NettyServerCnxn.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LeaderBean.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumTest.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1472 (See https://builds.apache.org/job/ZooKeeper-trunk/1472/ ) ZOOKEEPER-1361 . Leader.lead iterates over 'learners' set without proper synchronisation (henryr via camille) (Revision 1294000) Result = SUCCESS camille : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1294000 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/NettyServerCnxn.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LeaderBean.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumTest.java
        Hide
        Ross Cohen added a comment -

        I have experienced an issue which this fixes on 3.4. Specifically, there is a memory leak on the leader when an observer disconnects. It is fixed by this hunk:

        --- src/java/main/org/apache/zookeeper/server/quorum/Leader.java
        +++ src/java/main/org/apache/zookeeper/server/quorum/Leader.java
        @@ -119,6 +163,9 @@ public class Leader {
                 synchronized (learners) {
                     learners.remove(peer);
                 }
        +        synchronized (observingLearners) {
        +            observingLearners.remove(peer);
        +        }
             }
         
             boolean isLearnerSynced(LearnerHandler peer){
        

        Can at least this portion be applied to the 3.4 branch?

        Show
        Ross Cohen added a comment - I have experienced an issue which this fixes on 3.4. Specifically, there is a memory leak on the leader when an observer disconnects. It is fixed by this hunk: --- src/java/main/org/apache/zookeeper/server/quorum/Leader.java +++ src/java/main/org/apache/zookeeper/server/quorum/Leader.java @@ -119,6 +163,9 @@ public class Leader { synchronized (learners) { learners.remove(peer); } + synchronized (observingLearners) { + observingLearners.remove(peer); + } } boolean isLearnerSynced(LearnerHandler peer){ Can at least this portion be applied to the 3.4 branch?
        Hide
        Camille Fournier added a comment -

        Reopening this to propose a 3.4.X branch patch

        Show
        Camille Fournier added a comment - Reopening this to propose a 3.4.X branch patch
        Hide
        Camille Fournier added a comment -

        Created a patch for 3.4 based on this fix. Ross, can you apply it to your version and see if it fixes the problem you're seeing? Henry, can you glance at it and approve or propose changes?

        Show
        Camille Fournier added a comment - Created a patch for 3.4 based on this fix. Ross, can you apply it to your version and see if it fixes the problem you're seeing? Henry, can you glance at it and approve or propose changes?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12543890/ZOOKEEPER-1361-3.4.patch
        against trunk revision 1380931.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1177//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543890/ZOOKEEPER-1361-3.4.patch against trunk revision 1380931. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1177//console This message is automatically generated.
        Hide
        Ross Cohen added a comment -

        Attaching the version I tested. Memory leak did not appear with it. Looks like it's the same as what you posted except some lines are reordered and the CHANGES.txt bit.

        Show
        Ross Cohen added a comment - Attaching the version I tested. Memory leak did not appear with it. Looks like it's the same as what you posted except some lines are reordered and the CHANGES.txt bit.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12543947/zk-memory-leak-fix.patch
        against trunk revision 1380931.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1178//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543947/zk-memory-leak-fix.patch against trunk revision 1380931. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1178//console This message is automatically generated.
        Hide
        Camille Fournier added a comment -

        These patches are identical except that I added the synch around sendPacket to make it iterate over getForwardingFollowers, which was not in the original change and I'm not sure if it was an oversight or not. Can someone else take a look and say which should be there?

        Show
        Camille Fournier added a comment - These patches are identical except that I added the synch around sendPacket to make it iterate over getForwardingFollowers, which was not in the original change and I'm not sure if it was an oversight or not. Can someone else take a look and say which should be there?
        Hide
        Henry Robinson added a comment -

        Hey - sorry for the delay. I don't think the extra synchronisation in sendPacket is strictly necessary (because note that the forwardingFollowers lock is already held). However I think that the scoped lock around queuePacket is probably not required, should be removed, not the call to getForwardingFollowers. Make sense?

        Show
        Henry Robinson added a comment - Hey - sorry for the delay. I don't think the extra synchronisation in sendPacket is strictly necessary (because note that the forwardingFollowers lock is already held). However I think that the scoped lock around queuePacket is probably not required, should be removed, not the call to getForwardingFollowers. Make sense?
        Hide
        Mahadev konar added a comment -

        Thanks Camille/Ross/Henry,
        I am committing Ross's patch that is a straightforward port from trunk to the 3.4 branch. Attaching a cleaned up version of Ross's patch (removing CHANGES.txt changes).

        Show
        Mahadev konar added a comment - Thanks Camille/Ross/Henry, I am committing Ross's patch that is a straightforward port from trunk to the 3.4 branch. Attaching a cleaned up version of Ross's patch (removing CHANGES.txt changes).
        Hide
        Mahadev konar added a comment -

        Committed to 3.4. Thanks Henry/Camille/Ross!

        Show
        Mahadev konar added a comment - Committed to 3.4. Thanks Henry/Camille/Ross!

          People

          • Assignee:
            Henry Robinson
            Reporter:
            Henry Robinson
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development