Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4.6
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Thanks for the reviews and all the help, guys. Committed revision 1544858.

      Description

      See umbrella jira.

      1. logs.tar.gz
        5 kB
        Germán Blanco
      2. logs2.tar.gz
        6 kB
        Germán Blanco
      3. ZOOKEEPER-1817.patch
        26 kB
        Flavio Junqueira
      4. ZOOKEEPER-1817.patch
        26 kB
        Flavio Junqueira
      5. ZOOKEEPER-1817.patch
        26 kB
        Flavio Junqueira
      6. ZOOKEEPER-1817.patch
        28 kB
        Flavio Junqueira
      7. ZOOKEEPER-1817.patch
        28 kB
        Flavio Junqueira
      8. ZOOKEEPER-1817.patch
        18 kB
        Flavio Junqueira
      9. ZOOKEEPER-1817.patch
        16 kB
        Flavio Junqueira
      10. ZOOKEEPER-1817.patch
        11 kB
        Flavio Junqueira
      11. ZOOKEEPER-1817.patch
        11 kB
        Flavio Junqueira
      12. ZOOKEEPER-1817.patch
        11 kB
        Flavio Junqueira
      13. ZOOKEEPER-1817.patch
        11 kB
        Flavio Junqueira

        Issue Links

          Activity

          Hide
          Flavio Junqueira added a comment -

          Closing issues after releasing 3.4.6.

          Show
          Flavio Junqueira added a comment - Closing issues after releasing 3.4.6.
          Hide
          Raul Gutierrez Segales added a comment -

          Even though I didn't actually test this it looks correct to me (and nit-less ) - so +1. Will do proper testing for https://issues.apache.org/jira/browse/ZOOKEEPER-1818 once https://issues.apache.org/jira/browse/ZOOKEEPER-1810 lands.

          Show
          Raul Gutierrez Segales added a comment - Even though I didn't actually test this it looks correct to me (and nit-less ) - so +1. Will do proper testing for https://issues.apache.org/jira/browse/ZOOKEEPER-1818 once https://issues.apache.org/jira/browse/ZOOKEEPER-1810 lands.
          Hide
          Germán Blanco added a comment -

          +1, you are hearing right

          Show
          Germán Blanco added a comment - +1, you are hearing right
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12615351/ZOOKEEPER-1817.patch
          against trunk revision 1543281.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1801//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12615351/ZOOKEEPER-1817.patch against trunk revision 1543281. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1801//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Removed spaces. Germán Blanco, am I hearing a +1?

          Show
          Flavio Junqueira added a comment - Removed spaces. Germán Blanco , am I hearing a +1?
          Hide
          Germán Blanco added a comment -

          Responding to your proposal of removing backwards compatibility code, I agree.
          In my opinion, as long as there is an upgrade path from previous releases to new releases, introducing some mandatory steps (i.e. through 3.4.6 to reach 3.5.X) shouldn't be a big problem.

          Show
          Germán Blanco added a comment - Responding to your proposal of removing backwards compatibility code, I agree. In my opinion, as long as there is an upgrade path from previous releases to new releases, introducing some mandatory steps (i.e. through 3.4.6 to reach 3.5.X) shouldn't be a big problem.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12615327/ZOOKEEPER-1817.patch
          against trunk revision 1543281.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1800//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12615327/ZOOKEEPER-1817.patch against trunk revision 1543281. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1800//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Changed the way we compare versions as suggested.

          Show
          Flavio Junqueira added a comment - Changed the way we compare versions as suggested.
          Hide
          Flavio Junqueira added a comment -

          Thanks for bringing this up, you're right. The current predicate works for versions 0x0 and 0x1, but it doesn't seem to do it right moving forward. There is a caveat to it, however. I was thinking of removing the backward compatibility code we have introduce here for 3.5.0 or even 3.4.7 if we have it. The transition path for upgrading from say 3.4.5 to 3.5.0 would have to be through 3.4.6.

          I can make the change you suggest, and it is probably a better idea to make it more general just in case, but my thinking is that this code for backward compatibility should be gone soon. My sense is that we have quite a number of bugs because we have to keep introducing complicated logic to satisfy backward compatibility, so I'd like to try to remove it for future releases. Does it make sense to you?

          Show
          Flavio Junqueira added a comment - Thanks for bringing this up, you're right. The current predicate works for versions 0x0 and 0x1, but it doesn't seem to do it right moving forward. There is a caveat to it, however. I was thinking of removing the backward compatibility code we have introduce here for 3.5.0 or even 3.4.7 if we have it. The transition path for upgrading from say 3.4.5 to 3.5.0 would have to be through 3.4.6. I can make the change you suggest, and it is probably a better idea to make it more general just in case, but my thinking is that this code for backward compatibility should be gone soon. My sense is that we have quite a number of bugs because we have to keep introducing complicated logic to satisfy backward compatibility, so I'd like to try to remove it for future releases. Does it make sense to you?
          Hide
          Germán Blanco added a comment -

          This works very well and it looks excellent overall.
          However, I am starting to have doubts about this comparison in Vote.java:

          +            if (version == other.version) {
          +                return (id == other.id
          +                        && peerEpoch == other.peerEpoch);
          +            } else {
          +                return id == other.id;
          +            }
          

          We know 3.4.5 peerEpoch has a different value than in the rest. But the difference is there only for 3.4.5. I think this comparison should be a bit different, and more in line with the use of backwards compatibility vote in FastLeaderElection.java ...

                                              ToSend notmsg;
                                              if(n.version > 0x0) {
                                                  notmsg = new ToSend(
          

          that would be "if one of the versions is 0x0 and the other is not, then don't use peerEpoch. Otherwise use it" ...

          +            if ((version > 0x0) ^ (other.version > 0x0)) {
          +                return id == other.id;
          +            } else {
          +                return (id == other.id
          +                        && peerEpoch == other.peerEpoch);
          +            }
          
          Show
          Germán Blanco added a comment - This works very well and it looks excellent overall. However, I am starting to have doubts about this comparison in Vote.java: + if (version == other.version) { + return (id == other.id + && peerEpoch == other.peerEpoch); + } else { + return id == other.id; + } We know 3.4.5 peerEpoch has a different value than in the rest. But the difference is there only for 3.4.5. I think this comparison should be a bit different, and more in line with the use of backwards compatibility vote in FastLeaderElection.java ... ToSend notmsg; if(n.version > 0x0) { notmsg = new ToSend( that would be "if one of the versions is 0x0 and the other is not, then don't use peerEpoch. Otherwise use it" ... + if ((version > 0x0) ^ (other.version > 0x0)) { + return id == other.id; + } else { + return (id == other.id + && peerEpoch == other.peerEpoch); + }
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12615112/ZOOKEEPER-1817.patch
          against trunk revision 1543281.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1794//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12615112/ZOOKEEPER-1817.patch against trunk revision 1543281. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1794//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Made some improvements on FLE and the test case, please make sure this is ok.

          Show
          Flavio Junqueira added a comment - Made some improvements on FLE and the test case, please make sure this is ok.
          Hide
          Flavio Junqueira added a comment -

          .bq It should be QuorumMaj.class instead of QuorumVerifier.class

          I actually meant to remove it because there is no logging going on there, I added it for my own debugging.

          this has a "+1" from me

          I'd like to work a bit more on the test case, but I don't expect the overall behavior to change. Once I have it, I'll ask you to give it a last look if you don't mind.

          Thank you for the great work

          Thank you for all the help with reviewing and testing, it has been great.

          Show
          Flavio Junqueira added a comment - .bq It should be QuorumMaj.class instead of QuorumVerifier.class I actually meant to remove it because there is no logging going on there, I added it for my own debugging. this has a "+1" from me I'd like to work a bit more on the test case, but I don't expect the overall behavior to change. Once I have it, I'll ask you to give it a last look if you don't mind. Thank you for the great work Thank you for all the help with reviewing and testing, it has been great.
          Hide
          Germán Blanco added a comment -

          Just one additional comment ...
          I think that in this line:

           public class QuorumMaj implements QuorumVerifier {
          +    private static final Logger LOG = LoggerFactory.getLogger(QuorumVerifier.class);
          +    
          

          It should be QuorumMaj.class instead of QuorumVerifier.class
          In terms of behavior, this has a "+1" from me. I will still repeat the tests again in the last version of the patch, just in case.
          Thank you for the great work on this Flavio Junqueira!

          Show
          Germán Blanco added a comment - Just one additional comment ... I think that in this line: public class QuorumMaj implements QuorumVerifier { + private static final Logger LOG = LoggerFactory.getLogger(QuorumVerifier.class); + It should be QuorumMaj.class instead of QuorumVerifier.class In terms of behavior, this has a "+1" from me. I will still repeat the tests again in the last version of the patch, just in case. Thank you for the great work on this Flavio Junqueira !
          Hide
          Raul Gutierrez Segales added a comment -

          (as said before, i have only been testing the upstream version of this ticket)

          Some nits:

          +        if ((state == ServerState.LOOKING) ||
          +                (other.state == ServerState.LOOKING)) {
          +            return (id == other.id
                               && zxid == other.zxid
                               && electionEpoch == other.electionEpoch
                               && peerEpoch == other.peerEpoch);
          +        } else {
          +            if (version == other.version) {
          +                return (id == other.id
          +                        && peerEpoch == other.peerEpoch);
          +            } else {
          +                return id == other.id;
          +            }
          +        } 
          +    }
          

          could be simplified to:

          +        if ((state == ServerState.LOOKING) ||
          +                (other.state == ServerState.LOOKING)) {
          +            return (id == other.id
                               && zxid == other.zxid
                               && electionEpoch == other.electionEpoch
                               && peerEpoch == other.peerEpoch);
          +        } else if (version == other.version) {
          +            return id == other.id && peerEpoch == other.peerEpoch;
          +        }
          +
          +        return id == other.id;
          +    }
          

          In src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java:

          +//import org.apache.zookeeper.server.quorum.QuorumCnxManager;
          

          just delete that line?

          In src/java/test/org/apache/zookeeper/server/quorum/FLEDontCareTest.java I guess testOutofElection is still work in progress because of the commented code?

          Show
          Raul Gutierrez Segales added a comment - (as said before, i have only been testing the upstream version of this ticket) Some nits: + if ((state == ServerState.LOOKING) || + (other.state == ServerState.LOOKING)) { + return (id == other.id && zxid == other.zxid && electionEpoch == other.electionEpoch && peerEpoch == other.peerEpoch); + } else { + if (version == other.version) { + return (id == other.id + && peerEpoch == other.peerEpoch); + } else { + return id == other.id; + } + } + } could be simplified to: + if ((state == ServerState.LOOKING) || + (other.state == ServerState.LOOKING)) { + return (id == other.id && zxid == other.zxid && electionEpoch == other.electionEpoch && peerEpoch == other.peerEpoch); + } else if (version == other.version) { + return id == other.id && peerEpoch == other.peerEpoch; + } + + return id == other.id; + } In src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java: +//import org.apache.zookeeper.server.quorum.QuorumCnxManager; just delete that line? In src/java/test/org/apache/zookeeper/server/quorum/FLEDontCareTest.java I guess testOutofElection is still work in progress because of the commented code?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614685/ZOOKEEPER-1817.patch
          against trunk revision 1543281.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1792//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614685/ZOOKEEPER-1817.patch against trunk revision 1543281. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1792//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          I forgot some of the log messages I added for debugging, uploading a new patch.

          Show
          Flavio Junqueira added a comment - I forgot some of the log messages I added for debugging, uploading a new patch.
          Hide
          Raul Gutierrez Segales added a comment -

          Sorry guys I couldn't test this - don't have a 3.4 setup handy. Will do proper testing with trunk though (and of course, the nits in both cases . And thanks for testing Germán Blanco.

          Show
          Raul Gutierrez Segales added a comment - Sorry guys I couldn't test this - don't have a 3.4 setup handy. Will do proper testing with trunk though (and of course, the nits in both cases . And thanks for testing Germán Blanco .
          Hide
          Germán Blanco added a comment -

          As far as I can test, this works.
          There seems to be a continuous output every two seconds in the log of the leader:

           WARN  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:QuorumMaj@60] - Set size and half: 3, 1
          
          Show
          Germán Blanco added a comment - As far as I can test, this works. There seems to be a continuous output every two seconds in the log of the leader: WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:QuorumMaj@60] - Set size and half: 3, 1
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614628/ZOOKEEPER-1817.patch
          against trunk revision 1543281.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1791//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614628/ZOOKEEPER-1817.patch against trunk revision 1543281. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1791//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Thanks a lot for sharing the logs, Germán Blanco, it helped to spot a bug. I implemented a test case that reproduces the situation in the logs (but with 5 servers instead if 3). The problem with the test case is that it emulates the code, it doesn't really use the FLE code, which is not great moving forward. We may want to separate that out of election part of lookForLeader so that we can test properly.

          Show
          Flavio Junqueira added a comment - Thanks a lot for sharing the logs, Germán Blanco , it helped to spot a bug. I implemented a test case that reproduces the situation in the logs (but with 5 servers instead if 3). The problem with the test case is that it emulates the code, it doesn't really use the FLE code, which is not great moving forward. We may want to separate that out of election part of lookForLeader so that we can test properly.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614552/logs2.tar.gz
          against trunk revision 1543281.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1790//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614552/logs2.tar.gz against trunk revision 1543281. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1790//console This message is automatically generated.
          Hide
          Germán Blanco added a comment -

          This time it is the 3.4 server that can not join 3.4 + 3.4.5 ensemble.
          The first time it succeeds, but the rest it fails. I think it is because the first time it uses the last vote of the last election, but when it has to use the votes of the ensemble, then it fails.

          Show
          Germán Blanco added a comment - This time it is the 3.4 server that can not join 3.4 + 3.4.5 ensemble. The first time it succeeds, but the rest it fails. I think it is because the first time it uses the last vote of the last election, but when it has to use the votes of the ensemble, then it fails.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614476/ZOOKEEPER-1817.patch
          against trunk revision 1542355.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1789//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614476/ZOOKEEPER-1817.patch against trunk revision 1542355. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1789//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Here is another attempt based on the previous patch. I'm essentially keeping the vote prior to the update to send in the case we find messages with older versions. If this direction is good, then I need to produce a test case.

          Show
          Flavio Junqueira added a comment - Here is another attempt based on the previous patch. I'm essentially keeping the vote prior to the update to send in the case we find messages with older versions. If this direction is good, then I need to produce a test case.
          Hide
          Flavio Junqueira added a comment -

          ah, I can see where I got it wrong, thanks for checking.

          Show
          Flavio Junqueira added a comment - ah, I can see where I got it wrong, thanks for checking.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614421/logs.tar.gz
          against trunk revision 1542355.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1784//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614421/logs.tar.gz against trunk revision 1542355. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1784//console This message is automatically generated.
          Hide
          Germán Blanco added a comment -

          Sorry but no, it doesn't work.
          This attachment is of the logs of the three servers. First election works, second election fails. They are both of a 3.4.5 server trying to join a 3.4 + 3.4.5 ensemble.
          I believe that in order to work, it will be necessary to return the same values for election epoch and epoch as 3.4.5. That can be done with my previous suggestion, but maybe some other way is better.

          Show
          Germán Blanco added a comment - Sorry but no, it doesn't work. This attachment is of the logs of the three servers. First election works, second election fails. They are both of a 3.4.5 server trying to join a 3.4 + 3.4.5 ensemble. I believe that in order to work, it will be necessary to return the same values for election epoch and epoch as 3.4.5. That can be done with my previous suggestion, but maybe some other way is better.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614396/ZOOKEEPER-1817.patch
          against trunk revision 1542355.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1783//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614396/ZOOKEEPER-1817.patch against trunk revision 1542355. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1783//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          This is what I had in mind.

          Show
          Flavio Junqueira added a comment - This is what I had in mind.
          Hide
          Flavio Junqueira added a comment -

          Ok, that's bad news, although I think I know how to fix this one. I'll work on a new patch.

          Show
          Flavio Junqueira added a comment - Ok, that's bad news, although I think I know how to fix this one. I'll work on a new patch.
          Hide
          Germán Blanco added a comment -

          This works in almost all cases.
          3.4 is a server running the updated branch 3.4, patch included.
          3.3 is a server running the latest code in branch 3.3.
          Rolling upgrade from 3.3 to 3.4 works.
          Rolling upgrade from 3.4.5 to 3.4 works, unless there is a leader election in the wrong moment.
          That is because a 3.4.5 server is not always able to join an ensemble of a 3.4.5 and a 3.4. However some of the elections do finish. I found two potential causes:
          1 - election epoch reported by 3.4 follower after election is -1, instead of the round of the last election. This seems to be because of the change here:

                                      } else {
                                          /*
                                           * If this server is not looking, but the one that sent the ack
                                           * is looking, then send back what it believes to be the leader.
                                           */
                                          Vote current = self.getCurrentVote();
                                          if(ackstate == QuorumPeer.ServerState.LOOKING){
                                              if(LOG.isDebugEnabled()){
                                                  LOG.debug("Sending new notification. My id =  " +
                                                          self.getId() + " recipient=" +
                                                          response.sid + " zxid=0x" +
                                                          Long.toHexString(current.getZxid()) +
                                                          " leader=" + current.getId());
                                              }
                                              ToSend notmsg = new ToSend(
                                                      ToSend.mType.notification,
                                                      current.getId(),
                                                      current.getZxid(),
          +                                           current.getElectionEpoch(),
          -                                           logicalclock,
                                                      self.getPeerState(),
                                                      response.sid,
                                                      current.getPeerEpoch());
                                              sendqueue.offer(notmsg);
                                          }
                                      }
                                  }
          

          I am afraid this change was introduced by me in ZOOKEEPER-1732. The only purpose of the change was to be able to update the election epoch from FLETest. My assumption was that current.getElectionEpoch() was always the same as logicalclock when this function was called. I see now that this is not the case, and it causes problems. I suggest to put this back to what it was (logicalclock) and fix the test case if required.
          2 - The value of n.round is different (because of the "newEpoch-1" issue). This could be fixed by removing the call to updateElectionVote in Leader.java, and changing the parameter from newEpoch to newEpoch-1 in Learner.java.
          I have tried these two changes and they seem to enable finishing the election for the 3.4.5 server joining the 3.4+3.4.5 ensemble every time.
          I can upload logs, but given the amount of combinations, sending everything would be a mess. If you are interested in the logs of any of the nodes in any of the rolling upgrade test cases, please let me know and I will send them.

          Show
          Germán Blanco added a comment - This works in almost all cases. 3.4 is a server running the updated branch 3.4, patch included. 3.3 is a server running the latest code in branch 3.3. Rolling upgrade from 3.3 to 3.4 works. Rolling upgrade from 3.4.5 to 3.4 works, unless there is a leader election in the wrong moment. That is because a 3.4.5 server is not always able to join an ensemble of a 3.4.5 and a 3.4. However some of the elections do finish. I found two potential causes: 1 - election epoch reported by 3.4 follower after election is -1, instead of the round of the last election. This seems to be because of the change here: } else { /* * If this server is not looking, but the one that sent the ack * is looking, then send back what it believes to be the leader. */ Vote current = self.getCurrentVote(); if(ackstate == QuorumPeer.ServerState.LOOKING){ if(LOG.isDebugEnabled()){ LOG.debug("Sending new notification. My id = " + self.getId() + " recipient=" + response.sid + " zxid=0x" + Long.toHexString(current.getZxid()) + " leader=" + current.getId()); } ToSend notmsg = new ToSend( ToSend.mType.notification, current.getId(), current.getZxid(), + current.getElectionEpoch(), - logicalclock, self.getPeerState(), response.sid, current.getPeerEpoch()); sendqueue.offer(notmsg); } } } I am afraid this change was introduced by me in ZOOKEEPER-1732 . The only purpose of the change was to be able to update the election epoch from FLETest. My assumption was that current.getElectionEpoch() was always the same as logicalclock when this function was called. I see now that this is not the case, and it causes problems. I suggest to put this back to what it was (logicalclock) and fix the test case if required. 2 - The value of n.round is different (because of the "newEpoch-1" issue). This could be fixed by removing the call to updateElectionVote in Leader.java, and changing the parameter from newEpoch to newEpoch-1 in Learner.java. I have tried these two changes and they seem to enable finishing the election for the 3.4.5 server joining the 3.4+3.4.5 ensemble every time. I can upload logs, but given the amount of combinations, sending everything would be a mess. If you are interested in the logs of any of the nodes in any of the rolling upgrade test cases, please let me know and I will send them.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614356/ZOOKEEPER-1817.patch
          against trunk revision 1542355.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1780//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614356/ZOOKEEPER-1817.patch against trunk revision 1542355. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1780//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Thanks for reviewing this second time, Raul Gutierrez Segales. In general, I don't mind reformatting a line or two as part of revising a patch, but if all you want to suggest is to reformat of a line that was already there , then I'd rather do it in a separate jira and as part of the jira revisit all log messages. Is it fair?

          Also, since you reported the rolling upgrade issue before, could you make sure that this patch fixes the issue for you?

          Show
          Flavio Junqueira added a comment - Thanks for reviewing this second time, Raul Gutierrez Segales . In general, I don't mind reformatting a line or two as part of revising a patch, but if all you want to suggest is to reformat of a line that was already there , then I'd rather do it in a separate jira and as part of the jira revisit all log messages. Is it fair? Also, since you reported the rolling upgrade issue before, could you make sure that this patch fixes the issue for you?
          Hide
          Raul Gutierrez Segales added a comment -

          (of course, with the proper line wrap for > 80 chars).

          Show
          Raul Gutierrez Segales added a comment - (of course, with the proper line wrap for > 80 chars).
          Hide
          Raul Gutierrez Segales added a comment -

          One more nit (sorry Flavio Junqueira) in:

          -        return "(" + id + ", " + Long.toHexString(zxid) + ", " + Long.toHexString(peerEpoch) + ")";
          +        return "(" + id + ", " 
          +        			+ Long.toHexString(zxid) 
          +        			+ ", " + Long.toHexString(peerEpoch) 
          +        			+ ")";
          

          should we encourage String.format instead of concatenation (as we do in LOG statements with {})? I think this is more readable:

          -        return "(" + id + ", " + Long.toHexString(zxid) + ", " + Long.toHexString(peerEpoch) + ")";
          +        return String.format("(%d, %s, %s)", id, Long.toHexString(zxid), Long.toHexString(peerEpoch));
          

          What do you think?

          Show
          Raul Gutierrez Segales added a comment - One more nit (sorry Flavio Junqueira ) in: - return "(" + id + ", " + Long.toHexString(zxid) + ", " + Long.toHexString(peerEpoch) + ")"; + return "(" + id + ", " + + Long.toHexString(zxid) + + ", " + Long.toHexString(peerEpoch) + + ")"; should we encourage String.format instead of concatenation (as we do in LOG statements with {})? I think this is more readable: - return "(" + id + ", " + Long.toHexString(zxid) + ", " + Long.toHexString(peerEpoch) + ")"; + return String.format("(%d, %s, %s)", id, Long.toHexString(zxid), Long.toHexString(peerEpoch)); What do you think?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614277/ZOOKEEPER-1817.patch
          against trunk revision 1542355.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1776//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614277/ZOOKEEPER-1817.patch against trunk revision 1542355. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1776//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          Fixed according to Raul's comments.

          Show
          Flavio Junqueira added a comment - Fixed according to Raul's comments.
          Hide
          Raul Gutierrez Segales added a comment -

          Ah - the rb is https://reviews.apache.org/r/15625/. Though it's having issues - maybe try reloading? I guess reviewboard applies against the git mirrors and there was a lag in Apache's git-svn sync yesterday (i think).

          Show
          Raul Gutierrez Segales added a comment - Ah - the rb is https://reviews.apache.org/r/15625/ . Though it's having issues - maybe try reloading? I guess reviewboard applies against the git mirrors and there was a lag in Apache's git-svn sync yesterday (i think).
          Hide
          Raul Gutierrez Segales added a comment -

          With the mix of inline and reviewboard reviews I am not sure where we should review this one Is there a reviewboard for this one as well or just inline? If there is mind adding the link here for posterity - thanks Flavio Junqueira.

          Show
          Raul Gutierrez Segales added a comment - With the mix of inline and reviewboard reviews I am not sure where we should review this one Is there a reviewboard for this one as well or just inline? If there is mind adding the link here for posterity - thanks Flavio Junqueira .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12614222/ZOOKEEPER-1817.patch
          against trunk revision 1542355.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1775//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12614222/ZOOKEEPER-1817.patch against trunk revision 1542355. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1775//console This message is automatically generated.
          Hide
          Flavio Junqueira added a comment -

          QA should fail, it is a 3.4 patch.

          Show
          Flavio Junqueira added a comment - QA should fail, it is a 3.4 patch.
          Hide
          Flavio Junqueira added a comment -

          Germán Blanco, Raul Gutierrez Segales, please review and check that it doesn't break rolling upgrades. Also, waiting for your nits, Raul Gutierrez Segales.

          Show
          Flavio Junqueira added a comment - Germán Blanco , Raul Gutierrez Segales , please review and check that it doesn't break rolling upgrades. Also, waiting for your nits, Raul Gutierrez Segales .

            People

            • Assignee:
              Flavio Junqueira
              Reporter:
              Flavio Junqueira
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development