ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-498

Unending Leader Elections : WAN configuration

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.2.1, 3.3.0
    • Component/s: leaderElection
    • Labels:
      None
    • Environment:
    • Hadoop Flags:
      Reviewed

      Description

      In a WAN configuration, ZooKeeper is endlessly electing, terminating, and re-electing a ZooKeeper leader. The WAN configuration involves two groups, a central DC group of ZK servers that have a voting weight = 1, and a group of servers in remote pods with a voting weight of 0.

      What we expect to see is leaders elected only in the DC, and the pods to contain only followers. What we are seeing is a continuous cycling of leaders. We have seen this consistently with 3.2.0, 3.2.0 + recommended patches (473, 479, 481, 491), and now release 3.2.1.

      1. dc-zook-logs-01.tar.gz
        80 kB
        Todd Greenwood-Geer
      2. pod-zook-logs-01.tar.gz
        109 kB
        Todd Greenwood-Geer
      3. zk498-test.tar.gz
        1.29 MB
        Patrick Hunt
      4. zoo.cfg
        1 kB
        Todd Greenwood-Geer
      5. ZOOKEEPER-498.patch
        30 kB
        Flavio Junqueira
      6. ZOOKEEPER-498.patch
        30 kB
        Flavio Junqueira
      7. ZOOKEEPER-498.patch
        10 kB
        Flavio Junqueira
      8. ZOOKEEPER-498.patch
        4 kB
        Flavio Junqueira
      9. ZOOKEEPER-498-3.2.patch
        30 kB
        Flavio Junqueira

        Activity

        Hide
        Todd Greenwood-Geer added a comment -

        Zookeeper Logs and configuration files:

        dc1-zook01.log
        dc1-zook02.log
        dc1-zook03.log
        dc1-zook04.log
        dc1-zook05.log
        pd1-zook01.log
        pd1-zook02.log
        pd4-zook01.log
        pd4-zook02.log
        zoo.cfg

        Show
        Todd Greenwood-Geer added a comment - Zookeeper Logs and configuration files: dc1-zook01.log dc1-zook02.log dc1-zook03.log dc1-zook04.log dc1-zook05.log pd1-zook01.log pd1-zook02.log pd4-zook01.log pd4-zook02.log zoo.cfg
        Hide
        Patrick Hunt added a comment -

        Looks to me like 0 weight is still busted, fle0weighttest is actually failing on my machine, however it's reported as success:
        ------------- Standard Error -----------------
        Exception in thread "Thread-108" junit.framework.AssertionFailedError: Elected zero-weight server
        at junit.framework.Assert.fail(Assert.java:47)
        at org.apache.zookeeper.test.FLEZeroWeightTest$LEThread.run(FLEZeroWeightTest.java:138)
        ------------- ---------------- ---------------

        this is probably due because the test is calling assert in a thread other than the main test thread - which junit will not track/knowabout.

        One problem I see with these tests (0weight test I looked at) – it doesn't have a client attempt to connect to the various servers as part of declaring success. Really we should only consider "success"ful test (ie assert that) if a client can connect to each server in the cluster and change/seechanges. As part of fixing this we really need to do a sanity check by testing the various command lines and checking that a client can connect.

        I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new epoch seems to just thrash...

        Also I tried 3 & 5 server quorums "by hand from the command line" with 0 weight and they see similar issues to what Todd is seeing.

        this is happening for me on both the trunk and 3.2 branch source.

        Show
        Patrick Hunt added a comment - Looks to me like 0 weight is still busted, fle0weighttest is actually failing on my machine, however it's reported as success: ------------- Standard Error ----------------- Exception in thread "Thread-108" junit.framework.AssertionFailedError: Elected zero-weight server at junit.framework.Assert.fail(Assert.java:47) at org.apache.zookeeper.test.FLEZeroWeightTest$LEThread.run(FLEZeroWeightTest.java:138) ------------- ---------------- --------------- this is probably due because the test is calling assert in a thread other than the main test thread - which junit will not track/knowabout. One problem I see with these tests (0weight test I looked at) – it doesn't have a client attempt to connect to the various servers as part of declaring success. Really we should only consider "success"ful test (ie assert that) if a client can connect to each server in the cluster and change/seechanges. As part of fixing this we really need to do a sanity check by testing the various command lines and checking that a client can connect. I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new epoch seems to just thrash... Also I tried 3 & 5 server quorums "by hand from the command line" with 0 weight and they see similar issues to what Todd is seeing. this is happening for me on both the trunk and 3.2 branch source.
        Hide
        Patrick Hunt added a comment -

        I attached zk498-test.tar.gz - this is a 5 server config (2 0weight) that fails to achieve quorum.

        run start.sh/stop.sh and checkout the individual logs for details.

        Show
        Patrick Hunt added a comment - I attached zk498-test.tar.gz - this is a 5 server config (2 0weight) that fails to achieve quorum. run start.sh/stop.sh and checkout the individual logs for details.
        Hide
        Patrick Hunt added a comment -

        Please fix the following as well - incorrect logging levels are being used in quorum code, example:

        2009-08-05 15:17:02,733 - ERROR [WorkerSender Thread:QuorumCnxManager@341] - There is a connection for server 1
        2009-08-05 15:17:02,753 - ERROR [WorkerSender Thread:QuorumCnxManager@341] - There is a connection for server 2

        this is INFO, not ERROR

        Show
        Patrick Hunt added a comment - Please fix the following as well - incorrect logging levels are being used in quorum code, example: 2009-08-05 15:17:02,733 - ERROR [WorkerSender Thread:QuorumCnxManager@341] - There is a connection for server 1 2009-08-05 15:17:02,753 - ERROR [WorkerSender Thread:QuorumCnxManager@341] - There is a connection for server 2 this is INFO, not ERROR
        Hide
        Patrick Hunt added a comment -

        Todd,I did see an issue with your config, it's not:

        group.1:1:2:3

        rather it's:

        group.1=1:2:3

        (should be = not : )

        Regardless though - even after I fix this it's still not forming a cluster properly, we're still looking.

        Show
        Patrick Hunt added a comment - Todd,I did see an issue with your config, it's not: group.1:1:2:3 rather it's: group.1=1:2:3 (should be = not : ) Regardless though - even after I fix this it's still not forming a cluster properly, we're still looking.
        Hide
        Patrick Hunt added a comment -

        there are docs in the source code that provide good low level detail on flex quorum implementation

        HOWEVER, there are NO docs in the Ops guide detailing user level flex quorum operation

        we need to add docs (as part of this fix) to forrest detailing how to operate/troubleshoot/debug flex quorum

        Show
        Patrick Hunt added a comment - there are docs in the source code that provide good low level detail on flex quorum implementation HOWEVER, there are NO docs in the Ops guide detailing user level flex quorum operation we need to add docs (as part of this fix) to forrest detailing how to operate/troubleshoot/debug flex quorum
        Hide
        Flavio Junqueira added a comment -

        Pat, we have a description of how to configure in the "Cluster options" of the Administrator guide. We are missing an example, which is in the source code as you point out.

        Show
        Flavio Junqueira added a comment - Pat, we have a description of how to configure in the "Cluster options" of the Administrator guide. We are missing an example, which is in the source code as you point out.
        Hide
        Flavio Junqueira added a comment -

        I have generated a patch for this issue. I verified that I didn't do the correct checks in ZOOKEEPER-491, so I try to fix it in this patch. I have also modified the test to fix the problem with the fail assertion, and I have inspected the logs to see if it is behaving as expected. I can see no problem at this time with this patch.

        If someone else is interested in checking it out, please do it.

        Show
        Flavio Junqueira added a comment - I have generated a patch for this issue. I verified that I didn't do the correct checks in ZOOKEEPER-491 , so I try to fix it in this patch. I have also modified the test to fix the problem with the fail assertion, and I have inspected the logs to see if it is behaving as expected. I can see no problem at this time with this patch. If someone else is interested in checking it out, please do it.
        Hide
        Mahadev konar added a comment -

        flavio, can you include the example in the forrest docs? It would be good for folks using it. It gets quite confusing when using flexible quorums. exmaples/docs should help.

        thanks

        Show
        Mahadev konar added a comment - flavio, can you include the example in the forrest docs? It would be good for folks using it. It gets quite confusing when using flexible quorums. exmaples/docs should help. thanks
        Hide
        Flavio Junqueira added a comment -

        Have to add documentation and fix the test to verify that a leader in fact exercises its role as a leader.

        Show
        Flavio Junqueira added a comment - Have to add documentation and fix the test to verify that a leader in fact exercises its role as a leader.
        Hide
        Flavio Junqueira added a comment -

        Uploading a patch that works for me. It still doesn't fix the unit test, but others may want to give it a try.

        Show
        Flavio Junqueira added a comment - Uploading a patch that works for me. It still doesn't fix the unit test, but others may want to give it a try.
        Hide
        Flavio Junqueira added a comment -

        This patch includes documentation a reimplementation of HierarchicalQuorumTest.

        The new implementation of HierarchicalQuorumTest is based on QuorumBase, the main differences being that HQT uses hierarchical quorums and FLE for leader election. It uses testHammerBasic of ClientTest to verify that upon the election of a leader the ensemble works as expected.

        When I initially implemented the test, it was failing to terminate due to FLE failing to shutdown properly. I implemented some modifications to FLE to make sure that it shuts down correctly.

        Show
        Flavio Junqueira added a comment - This patch includes documentation a reimplementation of HierarchicalQuorumTest. The new implementation of HierarchicalQuorumTest is based on QuorumBase, the main differences being that HQT uses hierarchical quorums and FLE for leader election. It uses testHammerBasic of ClientTest to verify that upon the election of a leader the ensemble works as expected. When I initially implemented the test, it was failing to terminate due to FLE failing to shutdown properly. I implemented some modifications to FLE to make sure that it shuts down correctly.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12415848/ZOOKEEPER-498.patch
        against trunk revision 802108.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/177/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/177/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/177/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415848/ZOOKEEPER-498.patch against trunk revision 802108. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/177/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/177/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/177/console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        fwiw: I tested this out on trunk with a few diff configurations and it seems to have fixed the issue(s) seen before!

        9 servers : OK
        9 servers: weight 1:1:1:1:1:0:0:0:0 : OK
        9 servers: weight 1:1:1:1:1:0:0:0:0 3 groups 1:2:3:4:5 6:7 8:9 : OK
        9 servers: weight 1:1:1:1:1:0:0:0:0 1 group 1:2:3:4:5:6:7:8:9 : OK

        Show
        Patrick Hunt added a comment - fwiw: I tested this out on trunk with a few diff configurations and it seems to have fixed the issue(s) seen before! 9 servers : OK 9 servers: weight 1:1:1:1:1:0:0:0:0 : OK 9 servers: weight 1:1:1:1:1:0:0:0:0 3 groups 1:2:3:4:5 6:7 8:9 : OK 9 servers: weight 1:1:1:1:1:0:0:0:0 1 group 1:2:3:4:5:6:7:8:9 : OK
        Hide
        Todd Greenwood-Geer added a comment -

        Patrick,
        Thanks for the update. I'm closely following the dev alias and
        appreciate the effort the ZK team is putting in. For the time being,
        I'll stick with 3.1.1 and solve our WAN issues with an ensemble
        synchronizer.

        I'm in the middle of writing that bit right now.

        BTW - Should I succeed in convincing my company to allow me to open
        source various components that I've written (on top of zookeeper), what
        is the process for that?

        -Todd

        Show
        Todd Greenwood-Geer added a comment - Patrick, Thanks for the update. I'm closely following the dev alias and appreciate the effort the ZK team is putting in. For the time being, I'll stick with 3.1.1 and solve our WAN issues with an ensemble synchronizer. I'm in the middle of writing that bit right now. BTW - Should I succeed in convincing my company to allow me to open source various components that I've written (on top of zookeeper), what is the process for that? -Todd
        Hide
        Patrick Hunt added a comment -

        I also tried this (todd's original configuration) and it also works with this patch:

        9 servers: weight 1:1:1:1:1:0:0:0:0 2 groups 1:2:3:4:5 6:7:8:9 : OK

        server 1 config:
        ---------------------
        tickTime=2000
        initLimit=10
        syncLimit=5
        dataDir=./s1/data
        clientPort=2181
        electionAlg=3

        server.1=localhost:3181:4181
        server.2=localhost:3182:4182
        server.3=localhost:3183:4183
        server.4=localhost:3184:4184
        server.5=localhost:3185:4185
        server.6=localhost:3186:4186
        server.7=localhost:3187:4187
        server.8=localhost:3188:4188
        server.9=localhost:3189:4189

        weight.1=1
        weight.2=1
        weight.3=1
        weight.4=1
        weight.5=1
        weight.6=0
        weight.7=0
        weight.8=0
        weight.9=0

        group.1=1:2:3:4:5
        group.2=6:7:8:9

        Show
        Patrick Hunt added a comment - I also tried this (todd's original configuration) and it also works with this patch: 9 servers: weight 1:1:1:1:1:0:0:0:0 2 groups 1:2:3:4:5 6:7:8:9 : OK server 1 config: --------------------- tickTime=2000 initLimit=10 syncLimit=5 dataDir=./s1/data clientPort=2181 electionAlg=3 server.1=localhost:3181:4181 server.2=localhost:3182:4182 server.3=localhost:3183:4183 server.4=localhost:3184:4184 server.5=localhost:3185:4185 server.6=localhost:3186:4186 server.7=localhost:3187:4187 server.8=localhost:3188:4188 server.9=localhost:3189:4189 weight.1=1 weight.2=1 weight.3=1 weight.4=1 weight.5=1 weight.6=0 weight.7=0 weight.8=0 weight.9=0 group.1=1:2:3:4:5 group.2=6:7:8:9
        Hide
        Patrick Hunt added a comment -

        Todd, we appreciate your help and patience while we straighten this issue out. Thanks!

        Show
        Patrick Hunt added a comment - Todd, we appreciate your help and patience while we straighten this issue out. Thanks!
        Hide
        Patrick Hunt added a comment -

        Todd, the basic process from our end is that you should enter a JIRA and attach a patch. if you are contributing outside core, then you prolly want to add your stuff to src/contrib

        http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute

        Show
        Patrick Hunt added a comment - Todd, the basic process from our end is that you should enter a JIRA and attach a patch. if you are contributing outside core, then you prolly want to add your stuff to src/contrib http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute
        Hide
        Patrick Hunt added a comment -

        cancelling patch - I'm still seeing logs at incorrect level:

        s9/zoo.log:2009-08-07 13:54:16,154 - ERROR [WorkerSender Thread:QuorumCnxManager@332] - There is a connection for server 8
        s9/zoo.log:2009-08-07 13:54:16,244 - ERROR [WorkerSender Thread:QuorumCnxManager@332] - There is a connection for server 6
        s9/zoo.log:2009-08-07 13:54:16,614 - ERROR [WorkerSender Thread:QuorumCnxManager@332] - There is a connection for server 7
        s9/zoo.log:2009-08-07 13:55:47,355 - ERROR [WorkerSender Thread:QuorumCnxManager@332] - There is a connection for server 1

        Show
        Patrick Hunt added a comment - cancelling patch - I'm still seeing logs at incorrect level: s9/zoo.log:2009-08-07 13:54:16,154 - ERROR [WorkerSender Thread:QuorumCnxManager@332] - There is a connection for server 8 s9/zoo.log:2009-08-07 13:54:16,244 - ERROR [WorkerSender Thread:QuorumCnxManager@332] - There is a connection for server 6 s9/zoo.log:2009-08-07 13:54:16,614 - ERROR [WorkerSender Thread:QuorumCnxManager@332] - There is a connection for server 7 s9/zoo.log:2009-08-07 13:55:47,355 - ERROR [WorkerSender Thread:QuorumCnxManager@332] - There is a connection for server 1
        Hide
        Flavio Junqueira added a comment -

        Fixing log level.

        Show
        Flavio Junqueira added a comment - Fixing log level.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12415916/ZOOKEEPER-498.patch
        against trunk revision 802188.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/180/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/180/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/180/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415916/ZOOKEEPER-498.patch against trunk revision 802188. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/180/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/180/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/180/console This message is automatically generated.
        Hide
        Patrick Hunt added a comment -

        Looks good to me, I tested out a number of q configs (2/3/4/5/7/9) with various weights/groups (granted, not exhaustive), and the quorum always formed. Was able to connect client, also tried stopping/starting servers to ensure rejoin of quorum. Looks good to me.

        Also - the logging seems better, no longer errors for things that are not errors.

        Show
        Patrick Hunt added a comment - Looks good to me, I tested out a number of q configs (2/3/4/5/7/9) with various weights/groups (granted, not exhaustive), and the quorum always formed. Was able to connect client, also tried stopping/starting servers to ensure rejoin of quorum. Looks good to me. Also - the logging seems better, no longer errors for things that are not errors.
        Hide
        Patrick Hunt added a comment -

        btw, my testing was on trunk. I'll give branch a try too and report I find any issues (otw assume it's ok)

        Show
        Patrick Hunt added a comment - btw, my testing was on trunk. I'll give branch a try too and report I find any issues (otw assume it's ok)
        Hide
        Mahadev konar added a comment -

        flavio, can you upload a patch for 3.2 as well?

        Show
        Mahadev konar added a comment - flavio, can you upload a patch for 3.2 as well?
        Hide
        Benjamin Reed added a comment -

        +1 looks good. when setting the stop flags, you should really do an interrupt to wake up the wait, but that will cause a message to be printed to stdout. i'll open another jira to fix that.

        Show
        Benjamin Reed added a comment - +1 looks good. when setting the stop flags, you should really do an interrupt to wake up the wait, but that will cause a message to be printed to stdout. i'll open another jira to fix that.
        Hide
        Flavio Junqueira added a comment -

        Attaching 3.2 version of the patch.

        Show
        Flavio Junqueira added a comment - Attaching 3.2 version of the patch.
        Hide
        Mahadev konar added a comment -

        I just committed this. thanks flavio!

        Show
        Mahadev konar added a comment - I just committed this. thanks flavio!
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #413 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/413/)
        . Unending Leader Elections : WAN configuration (flavio via mahadev)

        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #413 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/413/ ) . Unending Leader Elections : WAN configuration (flavio via mahadev)

          People

          • Assignee:
            Flavio Junqueira
            Reporter:
            Todd Greenwood-Geer
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development