Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-3532

3 Master HA setup restarts every 3 minutes

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 0.23.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      CentOS 7.1, 3 Node cluster, each host has mesos master/slave and zookeeper setup.

      After I pushed out a bad zoo.cfg (added 2 extra zookeeper hosts that didn't exist) about every three minutes the elected master restarts and this keeps happening, when I have just one of the three masters running, it restarts every 3 minutes.

      I fixed the configs, deleted all the files under (/var/log/zookeeper/version-2/, /var/lib/zookeeper/version-2/). Is there another step I need to take, I feel like zookeeper is the issue (also where I lack knowledge), this cluster was stable for months until I push out the bad zoo.cfg.

      The master logs have this output every second:

      I0928 13:56:05.281518 28448 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
      I0928 13:56:05.351608 28450 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
      I0928 13:56:05.351794 28448 recover.cpp:195] Received a recover response from a replica in EMPTY status
      I0928 13:56:05.352700 28452 recover.cpp:195] Received a recover response from a replica in EMPTY status
      I0928 13:56:05.352963 28447 recover.cpp:195] Received a recover response from a replica in VOTING status

      The mesos-slaves don't even register in time:
      I0928 13:55:40.041491 28418 slave.cpp:3087] master@10.251.132.179:5050 exited
      W0928 13:55:40.041574 28418 slave.cpp:3090] Master disconnected! Waiting for a new master to be elected
      E0928 13:55:40.250059 28420 socket.hpp:107] Shutdown failed on fd=9: Transport endpoint is not connected [107]
      I0928 13:55:48.005607 28418 detector.cpp:138] Detected a new leader: (id='14')
      I0928 13:55:48.005836 28417 group.cpp:656] Trying to get '/mesos/info_0000000014' in ZooKeeper
      W0928 13:55:48.006597 28417 detector.cpp:444] Leading master master@10.251.132.177:5050 is using a Protobuf binary f...ESOS-2340)
      I0928 13:55:48.006652 28417 detector.cpp:481] A new leading master (UPID=master@10.251.132.177:5050) is detected
      I0928 13:55:48.006731 28417 slave.cpp:684] New master detected at master@10.251.132.177:5050
      I0928 13:55:48.006891 28417 slave.cpp:709] No credentials provided. Attempting to register without authentication
      I0928 13:55:48.006911 28417 slave.cpp:720] Detecting new master
      I0928 13:55:48.006940 28417 status_update_manager.cpp:176] Pausing sending status updates

        Attachments

        1. master01.tar
          7.49 MB
          Edward Donahue III
        2. master02.tar
          6.17 MB
          Edward Donahue III
        3. master03.tar
          7.75 MB
          Edward Donahue III

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                edonahue3rd Edward Donahue III
              • Votes:
                1 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: