Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-3532

3 Master HA setup restarts every 3 minutes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 0.23.0
    • None
    • None
    • None

    Description

      CentOS 7.1, 3 Node cluster, each host has mesos master/slave and zookeeper setup.

      After I pushed out a bad zoo.cfg (added 2 extra zookeeper hosts that didn't exist) about every three minutes the elected master restarts and this keeps happening, when I have just one of the three masters running, it restarts every 3 minutes.

      I fixed the configs, deleted all the files under (/var/log/zookeeper/version-2/, /var/lib/zookeeper/version-2/). Is there another step I need to take, I feel like zookeeper is the issue (also where I lack knowledge), this cluster was stable for months until I push out the bad zoo.cfg.

      The master logs have this output every second:

      I0928 13:56:05.281518 28448 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
      I0928 13:56:05.351608 28450 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
      I0928 13:56:05.351794 28448 recover.cpp:195] Received a recover response from a replica in EMPTY status
      I0928 13:56:05.352700 28452 recover.cpp:195] Received a recover response from a replica in EMPTY status
      I0928 13:56:05.352963 28447 recover.cpp:195] Received a recover response from a replica in VOTING status

      The mesos-slaves don't even register in time:
      I0928 13:55:40.041491 28418 slave.cpp:3087] master@10.251.132.179:5050 exited
      W0928 13:55:40.041574 28418 slave.cpp:3090] Master disconnected! Waiting for a new master to be elected
      E0928 13:55:40.250059 28420 socket.hpp:107] Shutdown failed on fd=9: Transport endpoint is not connected [107]
      I0928 13:55:48.005607 28418 detector.cpp:138] Detected a new leader: (id='14')
      I0928 13:55:48.005836 28417 group.cpp:656] Trying to get '/mesos/info_0000000014' in ZooKeeper
      W0928 13:55:48.006597 28417 detector.cpp:444] Leading master master@10.251.132.177:5050 is using a Protobuf binary f...ESOS-2340)
      I0928 13:55:48.006652 28417 detector.cpp:481] A new leading master (UPID=master@10.251.132.177:5050) is detected
      I0928 13:55:48.006731 28417 slave.cpp:684] New master detected at master@10.251.132.177:5050
      I0928 13:55:48.006891 28417 slave.cpp:709] No credentials provided. Attempting to register without authentication
      I0928 13:55:48.006911 28417 slave.cpp:720] Detecting new master
      I0928 13:55:48.006940 28417 status_update_manager.cpp:176] Pausing sending status updates

      Attachments

        1. master03.tar
          7.75 MB
          Edward Donahue III
        2. master02.tar
          6.17 MB
          Edward Donahue III
        3. master01.tar
          7.49 MB
          Edward Donahue III

        Issue Links

          Activity

            People

              Unassigned Unassigned
              edonahue3rd Edward Donahue III
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: