ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1343

getEpochToPropose should check if lastAcceptedEpoch is greater or equal than epoch

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.0
    • Fix Version/s: 3.4.3, 3.5.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The following block in Leader.getEpochToPropose:

      if (lastAcceptedEpoch > epoch) {
      	epoch = lastAcceptedEpoch+1;
      }
      

      needs to be fixed, since it doesn't increment the epoch variable in the case epoch != -1 (initial value) and lastAcceptedEpoch is equal. The fix trivial and corresponds to changing > with >=.

      1. ZOOKEEPER-1343.patch
        5 kB
        Flavio Junqueira
      2. ZOOKEEPER-1343.patch
        6 kB
        Flavio Junqueira
      3. ZOOKEEPER-1343.patch
        7 kB
        Flavio Junqueira
      4. ZOOKEEPER-1343-3.4.patch
        7 kB
        Flavio Junqueira

        Activity

        Hide
        Flavio Junqueira added a comment -

        Hi Alex, I'd like to suggest that we create a new jira for the issue you're raising, especially if it is happening in your branch, and not in trunk. We may want to look more carefully into it before rushing into a fix. Otherwise, your observations make sense.

        Show
        Flavio Junqueira added a comment - Hi Alex, I'd like to suggest that we create a new jira for the issue you're raising, especially if it is happening in your branch, and not in trunk. We may want to look more carefully into it before rushing into a fix. Otherwise, your observations make sense.
        Hide
        Alexander Shraer added a comment -

        Flavio, I think you should add

        if (leadThread != null)

        { leadThread.interrupt(); leadThread.join(); }

        at the end of your new test. I have the succeeding test periodically failing because the 33556 port used by both tests for the leader quorum port is still in use. Adding the code above would wait until the leader shuts down in your test.

        Actually, in addition to the above fix, I think we should change all the manually assigned ports in Zab1_0Test to be assigned using Portassignment.unique() like in other tests (I can do that if people agree that this should be done). If I understand correctly the static counter used in unique() to assign ports is initialized once per test file, so it would also prevent the problem I'm seeing here of two tests in the same file trying to use the same port.

        Unfortunately I'm only seeing this in my branch and not in trunk.

        Show
        Alexander Shraer added a comment - Flavio, I think you should add if (leadThread != null) { leadThread.interrupt(); leadThread.join(); } at the end of your new test. I have the succeeding test periodically failing because the 33556 port used by both tests for the leader quorum port is still in use. Adding the code above would wait until the leader shuts down in your test. Actually, in addition to the above fix, I think we should change all the manually assigned ports in Zab1_0Test to be assigned using Portassignment.unique() like in other tests (I can do that if people agree that this should be done). If I understand correctly the static counter used in unique() to assign ports is initialized once per test file, so it would also prevent the problem I'm seeing here of two tests in the same file trying to use the same port. Unfortunately I'm only seeing this in my branch and not in trunk.
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1417 (See https://builds.apache.org/job/ZooKeeper-trunk/1417/)
        ZOOKEEPER-1343. getEpochToPropose should check if lastAcceptedEpoch is greater or equal than epoch

        breed : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1227000
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1417 (See https://builds.apache.org/job/ZooKeeper-trunk/1417/ ) ZOOKEEPER-1343 . getEpochToPropose should check if lastAcceptedEpoch is greater or equal than epoch breed : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1227000 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
        Hide
        Benjamin Reed added a comment -

        +1 looks great. the tabs seem to have gotten worse. i'll fix on checkin.

        Show
        Benjamin Reed added a comment - +1 looks great. the tabs seem to have gotten worse. i'll fix on checkin.
        Hide
        Flavio Junqueira added a comment -

        The last patch I uploaded was for 3.4, not trunk. That's why QA failed.

        Show
        Flavio Junqueira added a comment - The last patch I uploaded was for 3.4, not trunk. That's why QA failed.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12508836/ZOOKEEPER-1343-3.4.patch
        against trunk revision 1225352.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/867//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508836/ZOOKEEPER-1343-3.4.patch against trunk revision 1225352. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/867//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Uploading a patch for the 3.4 branch.

        Show
        Flavio Junqueira added a comment - Uploading a patch for the 3.4 branch.
        Hide
        Alexander Shraer added a comment -

        you're right, sorry

        Show
        Alexander Shraer added a comment - you're right, sorry
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12508608/ZOOKEEPER-1343.patch
        against trunk revision 1222816.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/857//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/857//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/857//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508608/ZOOKEEPER-1343.patch against trunk revision 1222816. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/857//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/857//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/857//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Forgot to remove some redundant code.

        Show
        Flavio Junqueira added a comment - Forgot to remove some redundant code.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12508606/ZOOKEEPER-1343.patch
        against trunk revision 1222816.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/856//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/856//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/856//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508606/ZOOKEEPER-1343.patch against trunk revision 1222816. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/856//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/856//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/856//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Get QA to run on the new uploaded patch.

        Show
        Flavio Junqueira added a comment - Get QA to run on the new uploaded patch.
        Hide
        Flavio Junqueira added a comment -

        I have removed the extra method by adding some more code to the test. Seems cleaner to me.

        Show
        Flavio Junqueira added a comment - I have removed the extra method by adding some more code to the test. Seems cleaner to me.
        Hide
        Flavio Junqueira added a comment -

        I'm not sure I'm following you. Here is how I see it:

        1. The call to peer.setAcceptedEpoch(5) sets the value of QuorumPeer.acceptedEpoch to 5. The value 5 was arbitrarily chosen, of course.
        2. The call to leader.start() will invoke Leader.lead(), which will invoke getEpochToPropose() and will set the value of Leader.epoch to 5 (initial value of Leader.epoch is -1). The leader thread will block in getEpochToPropose() until a quorum runs getEpochToPropose().
        3. The mock follower executes getEpochToPropose(1,6). Since the leader + this follower form a quorum, both are able to continue and complete the execution of getEpochToPropose(). Without the fix, the value of Leader.epoch won't be 7 as it should be after the call of getEpochToPropose() from the mock follower.

        This is the flow I believe we have with the current test, and I think that multiple consecutive calls of getEpochToPropose() from the leader are not possible because the leader will block in the first call until it gets a quorum.

        Show
        Flavio Junqueira added a comment - I'm not sure I'm following you. Here is how I see it: The call to peer.setAcceptedEpoch(5) sets the value of QuorumPeer.acceptedEpoch to 5. The value 5 was arbitrarily chosen, of course. The call to leader.start() will invoke Leader.lead(), which will invoke getEpochToPropose() and will set the value of Leader.epoch to 5 (initial value of Leader.epoch is -1). The leader thread will block in getEpochToPropose() until a quorum runs getEpochToPropose(). The mock follower executes getEpochToPropose(1,6). Since the leader + this follower form a quorum, both are able to continue and complete the execution of getEpochToPropose(). Without the fix, the value of Leader.epoch won't be 7 as it should be after the call of getEpochToPropose() from the mock follower. This is the flow I believe we have with the current test, and I think that multiple consecutive calls of getEpochToPropose() from the leader are not possible because the leader will block in the first call until it gets a quorum.
        Hide
        Alexander Shraer added a comment -

        yeah, but when the leader invokes getEpochToPropose it passes self.getAcceptedEpoch() and you're invoking setAcceptingEpoch(5) in the test before starting the leader thread. So it should be invoked twice with 5.

        Show
        Alexander Shraer added a comment - yeah, but when the leader invokes getEpochToPropose it passes self.getAcceptedEpoch() and you're invoking setAcceptingEpoch(5) in the test before starting the leader thread. So it should be invoked twice with 5.
        Hide
        Flavio Junqueira added a comment -

        Leader initializes epoch to -1, so the first call always changes the value of epoch. The subsequent calls are the ones that can cause the problem.

        Show
        Flavio Junqueira added a comment - Leader initializes epoch to -1, so the first call always changes the value of epoch. The subsequent calls are the ones that can cause the problem.
        Hide
        Alexander Shraer added a comment -

        Flavio, just a suggestion - it might work if you don't do the wait you mention and instead do
        long epoch = leader.getEpochToPropose(1, 5);
        Assert.assertEquals("New proposed epoch is wrong", 6, epoch);

        Intuitively without the fix it should return 5 and with the fix should return 6.

        Show
        Alexander Shraer added a comment - Flavio, just a suggestion - it might work if you don't do the wait you mention and instead do long epoch = leader.getEpochToPropose(1, 5); Assert.assertEquals("New proposed epoch is wrong", 6, epoch); Intuitively without the fix it should return 5 and with the fix should return 6.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12508561/ZOOKEEPER-1343.patch
        against trunk revision 1221868.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/855//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/855//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/855//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508561/ZOOKEEPER-1343.patch against trunk revision 1221868. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/855//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/855//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/855//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        I have fixed the issues and added a test. In the new test, I need to wait until the mock leader executes getEpochToPropose before calling the same method mocking a follower. Otherwise I get a race and it doesn't fail consistently without the fix.

        To make it fail consistently without the fix, I have introduced a while loop that waits until the value of epoch changes to 6 (the one of the leader + 1) and to check the value of epoch I had to introduce a new public call. It is possibly not the cleanest way, but if anyone has a alternative suggestion, I'm willing to take it.

        Show
        Flavio Junqueira added a comment - I have fixed the issues and added a test. In the new test, I need to wait until the mock leader executes getEpochToPropose before calling the same method mocking a follower. Otherwise I get a race and it doesn't fail consistently without the fix. To make it fail consistently without the fix, I have introduced a while loop that waits until the value of epoch changes to 6 (the one of the leader + 1) and to check the value of epoch I had to introduce a new public call. It is possibly not the cleanest way, but if anyone has a alternative suggestion, I'm willing to take it.
        Hide
        Benjamin Reed added a comment -

        please also fix the tabs for spaces.

        Show
        Benjamin Reed added a comment - please also fix the tabs for spaces.
        Hide
        Flavio Junqueira added a comment -

        The problem described occurs when the last accepted epoch e of a peer p is one more than the last accepted epoch e' of another peer p' (e' = e + 1). If getEpochToPropose is called for p before being called for p', then the value of epoch will be e' instead of e' + 1.

        I can try to take a cut at a test. I have already checked that the modification doesn't break anything.

        Show
        Flavio Junqueira added a comment - The problem described occurs when the last accepted epoch e of a peer p is one more than the last accepted epoch e' of another peer p' (e' = e + 1). If getEpochToPropose is called for p before being called for p', then the value of epoch will be e' instead of e' + 1. I can try to take a cut at a test. I have already checked that the modification doesn't break anything.

          People

          • Assignee:
            Flavio Junqueira
            Reporter:
            Flavio Junqueira
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development