Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3940

Zookeeper restart of leader causes all zk nodes to not serve requests

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.6.2
    • None
    • quorum, server
    • None

    Description

      We have configured a 3 node zookeeper cluster using the 3.6.2 version in a Docker version 1.12.1 containerized environment. This corresponds to Sep 16 20:03:01 in the attached docker-containers.log files.

      NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 branch

      As a part of our testing, we have restarted each of the zookeeper nodes and have seen the following behaviour:

      zoo1, zoo2, and zoo3 healthy (zoo1 is leader)

      We started our testing at approximately Sep 17 13:01:05 in the attached docker-containers.log files.

      1. (simulate patching zoo2)

      • restart zoo2
      • zk_synced_followers 1
      • zoo1 leader
      • zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
      • zoo3 healthy
      • waited 5 minutes with no change
      • restart zoo3
      • zoo1 leader
      • zk_synced_followers 1
      • zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
      • zoo3 healthy
      • restart zoo2
      • no changes
      • restart zoo3
      • zoo1 leader
      • zk_synced_followers 2
      • zoo2 healthy
      • zoo3 unhealthy (This ZooKeeper instance is not currently serving requests)
      • waited 5 minutes and zoo3 returned to healthy

      2. simulate patching zoo3

      • zoo1 leader
      • restart zoo3
      • zk_synced_followers 2
      • zoo1, zoo2, and zoo3 healthy

      3. simulate patching zoo1

      • zoo1 leader
      • restart zoo1
      • zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently serving requests)
      • waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44
      • tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still unhealthy (this step was not collected in the log files).

      The third case in the above scenarios is the critical one since we are no longer able to start any of the zk nodes.

       

      Ling Mao this issue may relate to https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the first and second cases above that I am working with Brittany Barnes on.

      Attachments

        1. zoo.cfg
          1 kB
          Stan Henderson
        2. zk-docker-containers.log.zip
          808 kB
          Stan Henderson
        3. nossl-zoo.cfg
          0.4 kB
          Stan Henderson
        4. zk-docker-containers-nossl.log.zip
          41 kB
          Stan Henderson
        5. zoo1-docker-containers.log
          62 kB
          Stan Henderson
        6. zoo2-docker-containers.log
          95 kB
          Stan Henderson
        7. zoo3-docker-containers.log
          90 kB
          Stan Henderson
        8. zoo1-docker-containers.log
          322 kB
          Stan Henderson
        9. zoo.cfg
          0.6 kB
          Stan Henderson
        10. zoo1-follower.log
          95 kB
          Stan Henderson
        11. zoo2-leader.log
          150 kB
          Stan Henderson
        12. zoo3-follower.log
          142 kB
          Stan Henderson

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            stanhend Stan Henderson

            Dates

              Created:
              Updated:

              Slack

                Issue deployment