ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1521

LearnerHandler initLimit/syncLimit problems specifying follower socket timeout limits

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.3, 3.3.5, 3.5.0
    • Fix Version/s: 3.3.6, 3.4.4, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      branch 3.3: The leader is expecting the follower to initialize in syncLimit time rather than initLimit. In LearnerHandler run line 395 (branch33) we look for the ack from the follower with a timeout of syncLimit.

      branch 3.4+: seems like ZOOKEEPER-1136 introduced a regression while attempting to fix the problem. It sets the timeout as initLimit however it never sets the timeout to syncLimit once the ack is received.

      1. ZOOKEEPER-1521_br33.patch
        2 kB
        Patrick Hunt
      2. ZOOKEEPER-1521_br34.patch
        2 kB
        Patrick Hunt
      3. ZOOKEEPER-1521.patch
        2 kB
        Patrick Hunt

        Activity

        Hide
        Henry Robinson added a comment -

        Good catch. Here's what I notice in branch-3.3:

        • Leader.java:240 - the initial read timeout is set to syncLimit ticks, but the first thing we wait for is an ACK from the follower saying that it has got up-to-date, which should be subject to initLimit
        • I also saw that in Learner.java:220, the connection should be established with initLimit as the connection timeout (this is not fixed in branch-3.4). However, because there's a retry loop, there's no guarantee that we will connect in less than initLimit or syncLimit. So initLimit is not a hard limit at all; but it isn't already for other reasons.

        In branch-3.4:

        • LearnerHandler.java:336 sets the initial timeout to initLimit, but never sets it back again after the ACK. And it should just be setting the timeout in Leader.java:254 anyhow.
        Show
        Henry Robinson added a comment - Good catch. Here's what I notice in branch-3.3: Leader.java:240 - the initial read timeout is set to syncLimit ticks, but the first thing we wait for is an ACK from the follower saying that it has got up-to-date, which should be subject to initLimit I also saw that in Learner.java:220 , the connection should be established with initLimit as the connection timeout (this is not fixed in branch-3.4). However, because there's a retry loop, there's no guarantee that we will connect in less than initLimit or syncLimit. So initLimit is not a hard limit at all; but it isn't already for other reasons. In branch-3.4: LearnerHandler.java:336 sets the initial timeout to initLimit , but never sets it back again after the ACK. And it should just be setting the timeout in Leader.java:254 anyhow.
        Hide
        Patrick Hunt added a comment -

        These patches fix the problem. However I don't see a reasonable way to add a unit test for this with our current infrastructure. I will verify manually.

        Show
        Patrick Hunt added a comment - These patches fix the problem. However I don't see a reasonable way to add a unit test for this with our current infrastructure. I will verify manually.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12538025/ZOOKEEPER-1521.patch
        against trunk revision 1362660.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1144//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1144//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1144//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12538025/ZOOKEEPER-1521.patch against trunk revision 1362660. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1144//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1144//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1144//console This message is automatically generated.
        Hide
        Benjamin Reed added a comment -

        +1 looks good. it is reasonable to do a test for this?

        Show
        Benjamin Reed added a comment - +1 looks good. it is reasonable to do a test for this?
        Hide
        Patrick Hunt added a comment -

        I can't see a way to do it with our current infrastructure. I'm going to verify manually. I'm also talking to my QA team about adding this to their system test regime.

        Show
        Patrick Hunt added a comment - I can't see a way to do it with our current infrastructure. I'm going to verify manually. I'm also talking to my QA team about adding this to their system test regime.
        Hide
        Patrick Hunt added a comment -

        I verified this fix by running a very large snapshot file with the unfixed code and initlimit=200 and synclimit=2 (ticktime = 2000 in all cases), which failed to allow sufficient time during the init phase. I then re-ran this using the fixed code and the quorum came up successfully.

        Show
        Patrick Hunt added a comment - I verified this fix by running a very large snapshot file with the unfixed code and initlimit=200 and synclimit=2 (ticktime = 2000 in all cases), which failed to allow sufficient time during the init phase. I then re-ran this using the fixed code and the quorum came up successfully.
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1628 (See https://builds.apache.org/job/ZooKeeper-trunk/1628/)
        ZOOKEEPER-1521. LearnerHandler initLimit/syncLimit problems specifying follower socket timeout limits (phunt) (Revision 1366781)

        Result = SUCCESS
        phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1366781
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1628 (See https://builds.apache.org/job/ZooKeeper-trunk/1628/ ) ZOOKEEPER-1521 . LearnerHandler initLimit/syncLimit problems specifying follower socket timeout limits (phunt) (Revision 1366781) Result = SUCCESS phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1366781 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java

          People

          • Assignee:
            Patrick Hunt
            Reporter:
            Patrick Hunt
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development