ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-87

Follower does not shut itself down if its too far behind the leader.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.4.5, 3.5.0
    • Fix Version/s: 3.4.6, 3.5.0
    • Component/s: quorum
    • Labels:
    • Release Note:
      Hide
      The value of syncLimit should be reviewed before this change is introduced in the cluster. Since it now detects followers that are lagging behind, clusters in which this was happening will now see followers being disconnected from time to time. This will normally be solved by increasing the syncLimit value.
      Show
      The value of syncLimit should be reviewed before this change is introduced in the cluster. Since it now detects followers that are lagging behind, clusters in which this was happening will now see followers being disconnected from time to time. This will normally be solved by increasing the syncLimit value.

      Description

      Currently, the follower if lagging behind keeps sending pings to the leader it will stay alive and will keep getting further and further behind the leader. The follower should shut itself down if it is not able to keep up to the leader within some limit so that gurantee of updates can be made to the clients connected to different servers.

      1. ZOOKEEPER-87.patch
        10 kB
        Germán Blanco
      2. ZOOKEEPER-87_3.4.patch
        11 kB
        Germán Blanco
      3. ZOOKEEPER-87.patch
        10 kB
        Germán Blanco
      4. ZOOKEEPER-87_3.4.patch
        11 kB
        Germán Blanco

        Activity

        Hide
        Flavio Junqueira added a comment -

        Mahadev, I was thinking that we we could set a maximum capacity to queuedPackets on FollowerHandler, and check the remaining capacity on FollowerHandler.queuePacket() before accepting a new packet. We could then invoke FollowerHandler.shutdown() in the case we reach the capacity. What do you think? If you think this is a good idea, I can create a patch.

        Show
        Flavio Junqueira added a comment - Mahadev, I was thinking that we we could set a maximum capacity to queuedPackets on FollowerHandler, and check the remaining capacity on FollowerHandler.queuePacket() before accepting a new packet. We could then invoke FollowerHandler.shutdown() in the case we reach the capacity. What do you think? If you think this is a good idea, I can create a patch.
        Hide
        Mahadev konar added a comment -

        i was thinking abt the same lines but not on maximum capacity on follower handlers –

        ben and I had a discussion regarding this and we decided on doing htis —

        1) we send a Leader.PING from the leader to the follower — we will include the ticktime at the leader and send it to the follower

        2) the follower just returns back the tick time with the ping response.

        3) we update the ticktime that we saw last from the follower and then kill the follower handler if its lagging behind.

        – this approach is less error prone since we are getting response from the follower and know whats the last ticktime it processed. The problem with your suggestion is that you might be sending it out fast to the follower but the follower might be buffering all of it and not doing anything with it...

        though the above suggestion has a problem right now since the Leader.PING does not go through the pipeline at the follower. so a response to the ping does not necessarily mean that the follower is in line with processing the requests. We have to get an end to end response from the follower to see how much it is behind..

        Show
        Mahadev konar added a comment - i was thinking abt the same lines but not on maximum capacity on follower handlers – ben and I had a discussion regarding this and we decided on doing htis — 1) we send a Leader.PING from the leader to the follower — we will include the ticktime at the leader and send it to the follower 2) the follower just returns back the tick time with the ping response. 3) we update the ticktime that we saw last from the follower and then kill the follower handler if its lagging behind. – this approach is less error prone since we are getting response from the follower and know whats the last ticktime it processed. The problem with your suggestion is that you might be sending it out fast to the follower but the follower might be buffering all of it and not doing anything with it... though the above suggestion has a problem right now since the Leader.PING does not go through the pipeline at the follower. so a response to the ping does not necessarily mean that the follower is in line with processing the requests. We have to get an end to end response from the follower to see how much it is behind..
        Hide
        Flavio Junqueira added a comment -

        If you throttle at the follower, then the queue on the leader side will build up, and you won't have the problem you mention. The approach I describe seems to be safer as you don't have to wait for the next Leader.PING period to make a decision. It is true that the leader can send a ping at any time it wants, but if the leader is clueless as you say, then it will wait for the next ping period and it won't send any exceptional pings.

        Show
        Flavio Junqueira added a comment - If you throttle at the follower, then the queue on the leader side will build up, and you won't have the problem you mention. The approach I describe seems to be safer as you don't have to wait for the next Leader.PING period to make a decision. It is true that the leader can send a ping at any time it wants, but if the leader is clueless as you say, then it will wait for the next ping period and it won't send any exceptional pings.
        Hide
        Mahadev konar added a comment -

        throttling at the follower is a little tricky. do you mean the follower will stop reading requests ?

        since you do not know what is on the tcp stack —

        when you do a send it might be buffered on either of the tcp stacks on the leader and the follower. so you do not know how many txn's are buffered up the tcp stack. So throttling is a tricky thing with the follower and leader. also the rsponse to leader.PING are critical with session management.

        Show
        Mahadev konar added a comment - throttling at the follower is a little tricky. do you mean the follower will stop reading requests ? since you do not know what is on the tcp stack — when you do a send it might be buffered on either of the tcp stacks on the leader and the follower. so you do not know how many txn's are buffered up the tcp stack. So throttling is a tricky thing with the follower and leader. also the rsponse to leader.PING are critical with session management.
        Hide
        Vishal Kher added a comment -

        Hi Mahadev, Flavio,

        I am trying to understand how shutting down the follower will help in this case.

        Mahadev can you please explain what you man by : "The follower should shut itself down if it is not able to keep up to the leader within some limit so that guarantee of updates can be made to the clients connected to different servers."

        Under what conditions do you expect a shutdown of follower handler to eliminate the problem of follower lagging behind. Are you referring to clients connected to this server that will fail-over their session to another server?

        Thanks.
        -Vishal

        Show
        Vishal Kher added a comment - Hi Mahadev, Flavio, I am trying to understand how shutting down the follower will help in this case. Mahadev can you please explain what you man by : "The follower should shut itself down if it is not able to keep up to the leader within some limit so that guarantee of updates can be made to the clients connected to different servers." Under what conditions do you expect a shutdown of follower handler to eliminate the problem of follower lagging behind. Are you referring to clients connected to this server that will fail-over their session to another server? Thanks. -Vishal
        Hide
        Mahadev konar added a comment -

        exactly vishal. The problem is to avoid clients being connected to a server which can lag behind the leader.

        Show
        Mahadev konar added a comment - exactly vishal. The problem is to avoid clients being connected to a server which can lag behind the leader.
        Hide
        Mahadev konar added a comment -

        not a blocker. Moving it out of 3.4 release.

        Show
        Mahadev konar added a comment - not a blocker. Moving it out of 3.4 release.
        Hide
        Germán Blanco added a comment -

        Hello,
        I am rescueing this old isssue, since it seems to be still present.
        After a few tests, it seems to me that the issue of "lagging behind" affects transactions even more seriously. I haven't been able to find any timeout for proposals waiting for acknowledgement, is that really the case?
        I really think I must be wrong there. Otherwise, since ping handling seems to be done in a different thread, a follower could be locked at some point that has to do with the transaction but reply to pings and it would not be accepting any proposals. Besides, there is no warranty on the time that a transaction takes to be updated in all servers.
        Could anybody please tell me if I have read things wrongly?
        Thanks a lot,
        Germán.

        Show
        Germán Blanco added a comment - Hello, I am rescueing this old isssue, since it seems to be still present. After a few tests, it seems to me that the issue of "lagging behind" affects transactions even more seriously. I haven't been able to find any timeout for proposals waiting for acknowledgement, is that really the case? I really think I must be wrong there. Otherwise, since ping handling seems to be done in a different thread, a follower could be locked at some point that has to do with the transaction but reply to pings and it would not be accepting any proposals. Besides, there is no warranty on the time that a transaction takes to be updated in all servers. Could anybody please tell me if I have read things wrongly? Thanks a lot, Germán.
        Hide
        Germán Blanco added a comment -

        I would like to propose this fix for the problem. I keeps a list of the proposals sent to each Learner and shuts it down if any of the proposals reaches the timeout.
        It also skips sending pings if there are any proposals waiting for acknowledgement.
        According to my understanding the new behaviour fits better with the current description of syncLimit.

        Show
        Germán Blanco added a comment - I would like to propose this fix for the problem. I keeps a list of the proposals sent to each Learner and shuts it down if any of the proposals reaches the timeout. It also skips sending pings if there are any proposals waiting for acknowledgement. According to my understanding the new behaviour fits better with the current description of syncLimit.
        Hide
        Germán Blanco added a comment -

        This patch keeps track of all proposals sent to each Follower and shuts the handler down if one of them is not responded in syncLimit*tickTime.
        It also skips sending the ping if there is any proposal waiting for acknowledgement.

        Show
        Germán Blanco added a comment - This patch keeps track of all proposals sent to each Follower and shuts the handler down if one of them is not responded in syncLimit*tickTime. It also skips sending the ping if there is any proposal waiting for acknowledgement.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12581196/ZOOKEEPER-87.patch
        against trunk revision 1463329.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1466//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1466//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1466//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581196/ZOOKEEPER-87.patch against trunk revision 1463329. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1466//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1466//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1466//console This message is automatically generated.
        Hide
        Germán Blanco added a comment -

        Removing the complain from FindBugs.

        Show
        Germán Blanco added a comment - Removing the complain from FindBugs.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12581563/ZOOKEEPER-87.patch
        against trunk revision 1463329.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1469//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1469//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1469//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581563/ZOOKEEPER-87.patch against trunk revision 1463329. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1469//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1469//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1469//console This message is automatically generated.
        Hide
        Germán Blanco added a comment -

        I have updated the patch to the current trunk version.
        Any comments on the proposal?

        Show
        Germán Blanco added a comment - I have updated the patch to the current trunk version. Any comments on the proposal?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12597127/ZOOKEEPER-87_trunk.patch
        against trunk revision 1503101.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1531//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1531//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1531//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597127/ZOOKEEPER-87_trunk.patch against trunk revision 1503101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1531//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1531//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1531//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12597270/ZOOKEEPER-87_trunk.patch
        against trunk revision 1503101.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1533//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1533//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1533//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12597270/ZOOKEEPER-87_trunk.patch against trunk revision 1503101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1533//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1533//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1533//console This message is automatically generated.
        Hide
        Germán Blanco added a comment -

        Could anyone please comment on the proposal in the patch?
        In my view it offers the following advantages against the previous proposals in the thread:

        • There is no need to modify the network protocol (adding information to PING requests and responses).
        • It solves the problem based on time difference, and not based on the number of proposals. A limit on the number of pending proposals could still allow for a follower to be behind in terms of time.
        • It seems to map the behavior described already for syncLimit: "Amount of time, in ticks (see tickTime), to allow followers to sync with ZooKeeper. If followers fall too far behind a leader, they will be dropped."
        Show
        Germán Blanco added a comment - Could anyone please comment on the proposal in the patch? In my view it offers the following advantages against the previous proposals in the thread: There is no need to modify the network protocol (adding information to PING requests and responses). It solves the problem based on time difference, and not based on the number of proposals. A limit on the number of pending proposals could still allow for a follower to be behind in terms of time. It seems to map the behavior described already for syncLimit: "Amount of time, in ticks (see tickTime), to allow followers to sync with ZooKeeper. If followers fall too far behind a leader, they will be dropped."
        Hide
        Germán Blanco added a comment -

        This is actually the same patch. The last test seemed to failed due to problems in the machine?

        Show
        Germán Blanco added a comment - This is actually the same patch. The last test seemed to failed due to problems in the machine?
        Hide
        Germán Blanco added a comment -

        I changed the "Asignee" because it was the only way I found to submit the patch. I hope that it is not a problem, at least it seems very easy to revert .

        Show
        Germán Blanco added a comment - I changed the "Asignee" because it was the only way I found to submit the patch. I hope that it is not a problem, at least it seems very easy to revert .
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12603795/ZOOKEEPER-87_trunk.patch
        against trunk revision 1524275.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1585//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1585//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1585//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603795/ZOOKEEPER-87_trunk.patch against trunk revision 1524275. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1585//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1585//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1585//console This message is automatically generated.
        Hide
        Germán Blanco added a comment -

        Yet a different approach.
        This patch doesn't keep a list of all pending transactions for a Learner. It just keeps track of one transaction and makes sure that this transaction makes it on time. When the ACK for this transaction arrives, it picks anotherone. In the worst case there could be a delay of 2*syncLimit in one transaction, but it is a very cheap (in terms of processing time) way to ensure that the lagging doesn't get out of hand.

        Show
        Germán Blanco added a comment - Yet a different approach. This patch doesn't keep a list of all pending transactions for a Learner. It just keeps track of one transaction and makes sure that this transaction makes it on time. When the ACK for this transaction arrives, it picks anotherone. In the worst case there could be a delay of 2*syncLimit in one transaction, but it is a very cheap (in terms of processing time) way to ensure that the lagging doesn't get out of hand.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12603855/ZOOKEEPER-87.patch
        against trunk revision 1524398.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1588//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1588//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1588//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12603855/ZOOKEEPER-87.patch against trunk revision 1524398. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1588//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1588//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1588//console This message is automatically generated.
        Hide
        Germán Blanco added a comment -

        This patch is cleaner. The test case applies better to the modification. And it leaves some margin in the syncLimit for the test case that failed in the last try with Jenkins.

        Show
        Germán Blanco added a comment - This patch is cleaner. The test case applies better to the modification. And it leaves some margin in the syncLimit for the test case that failed in the last try with Jenkins.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12604034/ZOOKEEPER-87.patch
        against trunk revision 1524398.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 10 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1591//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1591//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1591//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12604034/ZOOKEEPER-87.patch against trunk revision 1524398. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1591//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1591//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1591//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12604193/ZOOKEEPER-87.patch
        against trunk revision 1524398.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 10 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1592//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1592//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1592//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12604193/ZOOKEEPER-87.patch against trunk revision 1524398. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1592//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1592//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1592//console This message is automatically generated.
        Hide
        Germán Blanco added a comment -

        Eighth try, and it finally passes all tests!
        The ZooKeeper regression is a great tool. Not only there are a lot of use cases covered, but some of the test cases are very sensitive to problems that are not related with the main purpose of the test case .

        Show
        Germán Blanco added a comment - Eighth try, and it finally passes all tests! The ZooKeeper regression is a great tool. Not only there are a lot of use cases covered, but some of the test cases are very sensitive to problems that are not related with the main purpose of the test case .
        Hide
        Germán Blanco added a comment -

        By the way, before anybody complains, I did pass the regression locally before submitting the patch and it did work. But timing is an issue and my machine is way faster than the Jenkins run so ...

        Show
        Germán Blanco added a comment - By the way, before anybody complains, I did pass the regression locally before submitting the patch and it did work. But timing is an issue and my machine is way faster than the Jenkins run so ...
        Hide
        Flavio Junqueira added a comment -

        Thanks for picking this one up, German. Given how currentTime is updated, I was wondering if the leader will disconnect from followers if it is idle for some time. It looks like will force the LearnerHandler to shut down if there is a period of inactivity, but perhaps I'm not reading it right.

        Also, please don't delete the patch files you upload, even if you're uploading new ones.

        Show
        Flavio Junqueira added a comment - Thanks for picking this one up, German. Given how currentTime is updated, I was wondering if the leader will disconnect from followers if it is idle for some time. It looks like will force the LearnerHandler to shut down if there is a period of inactivity, but perhaps I'm not reading it right. Also, please don't delete the patch files you upload, even if you're uploading new ones.
        Hide
        Flavio Junqueira added a comment -

        I forgot to mention the following. This is not a blocker for 3.4.6, so if we can't converge on a patch for this one, I'd like to push it to 3.5.0. Is it ok?

        Show
        Flavio Junqueira added a comment - I forgot to mention the following. This is not a blocker for 3.4.6, so if we can't converge on a patch for this one, I'd like to push it to 3.5.0. Is it ok?
        Hide
        Germán Blanco added a comment -

        Thanks to you for taking a look, Flavio!
        If it is idle, then it doesn't do anything. The implementation is a class that has 4 methods:

        • start() : Invoked from LearnerHandler thread.
        • updateProposal() : Invoked from the thread that sends packets. If process has started, and there is no proposal, it starts monitoring that proposal. If there is a proposal already, it updates the "next proposal" to monitor.
        • updateAck() : Invoked from the thread that receives packets. If the ACK corresponds to the current proposal, then it removes the current proposal and if there is a "next proposal" it starts using that one.
        • check() : Invoked from the Leader thread. It returns true if everything is ok, false if there is a timeout.

        It has the following states:

        • NOT_STARTED: It doesn't do anything. Check will always return true. It can only transition to WAITING_FOR_PROPOSAL.
        • WAITING_FOR_PROPOSAL: It expects a new proposal. Check will always return true. As soon as one proposal arrives it will start monitoring it and transition to MONITORING_PROPOSAL.
        • MONITORING_PROPOSAL: It waits for the ACK of the proposal under monitoring. Check will return true unless there is a timeout for the current proposal. It continuously update the "next proposal" with incoming proposals. If the ACK for the current proposal arrives, it will either update the proposal with the "next proposal" or transition to WAITING_FOR_PROPOSAL if there is no "next proposal".

        The monitoring is started when the synchronization phase ends (and the timeout in the socket is updated to syncLimit*tickTime). The processing required in each of the calls to the four functions is minimal, so there should be no disturbance of performance at all.
        Since proposals are executed in order, with this checking the longest time between a proposal and its ACK will be if there is a proposal that comes immediately after the one that is being monitored, and its ACK arrives immediately before the ACK for the "next proposal" after that. That would mean (2 * syncLimit) * tickTime, and then since we check the timeout every 1/2 tickTime, that needs to be added so it is (2'5 * syncLimit) * tickTime. In usual operation, ACKs will take more or less the same for every proposal and the timeout will jump more or less as soon as any of the ACKs takes longer than (syncLimit * tickTime).

        Or at least that is how I intend it to work .

        ... and I don't mind to turn this upside down for anything else that covers the requirement, so I hope that we won't have a problem in converging on a patch ... we'll see.

        Sorry about deleting the patches. It made me unconfortable to leave there something that didn't work. But I will keep in mind that way of working from now on.

        Show
        Germán Blanco added a comment - Thanks to you for taking a look, Flavio! If it is idle, then it doesn't do anything. The implementation is a class that has 4 methods: start() : Invoked from LearnerHandler thread. updateProposal() : Invoked from the thread that sends packets. If process has started, and there is no proposal, it starts monitoring that proposal. If there is a proposal already, it updates the "next proposal" to monitor. updateAck() : Invoked from the thread that receives packets. If the ACK corresponds to the current proposal, then it removes the current proposal and if there is a "next proposal" it starts using that one. check() : Invoked from the Leader thread. It returns true if everything is ok, false if there is a timeout. It has the following states: NOT_STARTED: It doesn't do anything. Check will always return true. It can only transition to WAITING_FOR_PROPOSAL. WAITING_FOR_PROPOSAL: It expects a new proposal. Check will always return true. As soon as one proposal arrives it will start monitoring it and transition to MONITORING_PROPOSAL. MONITORING_PROPOSAL: It waits for the ACK of the proposal under monitoring. Check will return true unless there is a timeout for the current proposal. It continuously update the "next proposal" with incoming proposals. If the ACK for the current proposal arrives, it will either update the proposal with the "next proposal" or transition to WAITING_FOR_PROPOSAL if there is no "next proposal". The monitoring is started when the synchronization phase ends (and the timeout in the socket is updated to syncLimit*tickTime). The processing required in each of the calls to the four functions is minimal, so there should be no disturbance of performance at all. Since proposals are executed in order, with this checking the longest time between a proposal and its ACK will be if there is a proposal that comes immediately after the one that is being monitored, and its ACK arrives immediately before the ACK for the "next proposal" after that. That would mean (2 * syncLimit) * tickTime, and then since we check the timeout every 1/2 tickTime, that needs to be added so it is (2'5 * syncLimit) * tickTime. In usual operation, ACKs will take more or less the same for every proposal and the timeout will jump more or less as soon as any of the ACKs takes longer than (syncLimit * tickTime). Or at least that is how I intend it to work . ... and I don't mind to turn this upside down for anything else that covers the requirement, so I hope that we won't have a problem in converging on a patch ... we'll see. Sorry about deleting the patches. It made me unconfortable to leave there something that didn't work. But I will keep in mind that way of working from now on.
        Hide
        Germán Blanco added a comment -

        I made (at least) two mistakes above:

        • If there are no proposals (idle ZooKeeper), then this thing is not idle ("doing nothing" as in NOT_STARTED) it is in WAITING_FOR_PROPOSAL, returning always true for check and ready to be updated with a new proposal.
        • "since we check the timeout every 1/2 tickTime", means half a ticktime needs to be added, not half a "syncLimit * tickTime". So actually the "longest time between a proposal and its ACK " is (((2 * syncLimit) + 1/2) * tickTime)
          ... isn't there a way to edit comments after one posts them?
        Show
        Germán Blanco added a comment - I made (at least) two mistakes above: If there are no proposals (idle ZooKeeper), then this thing is not idle ("doing nothing" as in NOT_STARTED) it is in WAITING_FOR_PROPOSAL, returning always true for check and ready to be updated with a new proposal. "since we check the timeout every 1/2 tickTime", means half a ticktime needs to be added, not half a "syncLimit * tickTime". So actually the "longest time between a proposal and its ACK " is (((2 * syncLimit) + 1/2) * tickTime) ... isn't there a way to edit comments after one posts them?
        Hide
        Germán Blanco added a comment -

        Any more questions on this one?

        Show
        Germán Blanco added a comment - Any more questions on this one?
        Hide
        Flavio Junqueira added a comment -

        I haven't had a chance to process this one yet, I'm sorry.

        Show
        Flavio Junqueira added a comment - I haven't had a chance to process this one yet, I'm sorry.
        Hide
        Flavio Junqueira added a comment -

        Here are a few comments, German:

        1. Please make the initialization of SyncLimitCheck fields explicit.
        2. Please add blank lines between method definitions (http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-141388.html#487)
        3. The test case does not fail for me when I only apply the test case changes. I haven't looked too closely into it to determine the reason, but I was wondering if it fails for you.
        Show
        Flavio Junqueira added a comment - Here are a few comments, German: Please make the initialization of SyncLimitCheck fields explicit. Please add blank lines between method definitions ( http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-141388.html#487 ) The test case does not fail for me when I only apply the test case changes. I haven't looked too closely into it to determine the reason, but I was wondering if it fails for you.
        Hide
        Germán Blanco added a comment -

        Thanks again for your comments Flavio,
        it did fail, but I hadn't tried it enough. Sometimes it didn't fail.
        There was a problem in the test, it didn't reply to pings from the leader.
        So if the LearnerHandler decided that there was a ping timeout before it decided that the proposal had timeout, then the test case passed (it finishes before it had a chance to notice the other timeout). And the timeout period for both events is the same.
        I will upload the corrected patches with your comments updated in a moment.

        Show
        Germán Blanco added a comment - Thanks again for your comments Flavio, it did fail, but I hadn't tried it enough. Sometimes it didn't fail. There was a problem in the test, it didn't reply to pings from the leader. So if the LearnerHandler decided that there was a ping timeout before it decided that the proposal had timeout, then the test case passed (it finishes before it had a chance to notice the other timeout). And the timeout period for both events is the same. I will upload the corrected patches with your comments updated in a moment.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12605207/ZOOKEEPER-87.patch
        against trunk revision 1526337.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 10 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1600//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1600//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1600//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12605207/ZOOKEEPER-87.patch against trunk revision 1526337. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1600//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1600//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1600//console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        It looks good to me, but I have one remaining question if you don't mind. Why are you synchronizing the methods in SyncLimitCheck? There is one instance per LearnerHandler so I'm not sure it is necessary.

        Show
        Flavio Junqueira added a comment - It looks good to me, but I have one remaining question if you don't mind. Why are you synchronizing the methods in SyncLimitCheck? There is one instance per LearnerHandler so I'm not sure it is necessary.
        Hide
        Germán Blanco added a comment -

        No problem at all
        If I am not wrong, the methods are called from different threads. It seemed so also in the logs.

        • "ping" is called from the "lead" method in Leader.java that runs within the QuorumPeer thread.
        • "updateProposal" is called from "sendPackets" which has an independent thread created in "startSendingPackets".
        • "start" and "updateAck" are called from "run" in the LearnerHandler thread.
        Show
        Germán Blanco added a comment - No problem at all If I am not wrong, the methods are called from different threads. It seemed so also in the logs. "ping" is called from the "lead" method in Leader.java that runs within the QuorumPeer thread. "updateProposal" is called from "sendPackets" which has an independent thread created in "startSendingPackets". "start" and "updateAck" are called from "run" in the LearnerHandler thread.
        Hide
        Flavio Junqueira added a comment -

        Thanks, German. +1.

        Trunk Committed revision 1526454.
        B3.4 Committed revision 1526461.

        Show
        Flavio Junqueira added a comment - Thanks, German. +1. Trunk Committed revision 1526454. B3.4 Committed revision 1526461.
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in ZooKeeper-trunk #2069 (See https://builds.apache.org/job/ZooKeeper-trunk/2069/)
        ZOOKEEPER-87. Follower does not shut itself down if its too far behind the leader. (German Blanco via fpj) (fpj: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1526454)

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumTest.java
        • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumUtil.java
        Show
        Hudson added a comment - SUCCESS: Integrated in ZooKeeper-trunk #2069 (See https://builds.apache.org/job/ZooKeeper-trunk/2069/ ) ZOOKEEPER-87 . Follower does not shut itself down if its too far behind the leader. (German Blanco via fpj) (fpj: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1526454 ) /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumTest.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumUtil.java
        Hide
        Flavio Junqueira added a comment -

        Closing issues after releasing 3.4.6.

        Show
        Flavio Junqueira added a comment - Closing issues after releasing 3.4.6.

          People

          • Assignee:
            Germán Blanco
            Reporter:
            Mahadev konar
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development