Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.23.1, 0.24.1, 0.25.0, 0.26.0, 0.27.2, 0.28.0
-
CentOS 7.1
-
Mesosphere Sprint 32
-
2
Description
A missing default for quorum size has generated the following master config
MESOS_WORK_DIR="/var/lib/mesos/master" MESOS_ZK="zk://zk1:2181,zk2:2181,zk3:2181/mesos" MESOS_QUORUM= MESOS_PORT=5050 MESOS_CLUSTER="mesos" MESOS_LOG_DIR="/var/log/mesos" MESOS_LOGBUFSECS=1 MESOS_LOGGING_LEVEL="INFO"
This was causing each elected leader to attempt replica recovery.
E.g. group.cpp:700] Trying to get '/mesos/log_replicas/0000000012' in ZooKeeper
And eventually:
master.cpp:1458] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Full log on one of the masters https://gist.github.com/clehene/09a9ddfe49b92a5deb4c1b421f63479e
All masters and zk nodes were reachable over the network.
Also once the quorum was configured the master recovery protocol finished gracefully.