Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.1.3, 1.2.2, 1.3.1, 1.4.0
-
Mesosphere Sprint 67, Mesosphere Sprint 68, Mesosphere Sprint 69
-
2
Description
Over the past year, the Marathon team has been plagued with an issue that hits our CI builds periodically in which the scheduler driver enters a tight loop, sending 10,000s of SUBSCRIBE calls to the master per second. I turned on debug logging for the client and the server, and it pointed to an issue with the doReliableRegistration method in sched.cpp. Here's the logs:
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.099815 13397 process.cpp:1383] libprocess is initialized on 127.0.1.1:60957 with 8 worker threads WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.118237 13397 logging.cpp:199] Logging to STDERR WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.128921 13416 sched.cpp:232] Version: 1.4.0 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151785 13791 group.cpp:341] Group process (zookeeper-group(1)@127.0.1.1:60957) connected to ZooKeeper WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151823 13791 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151837 13791 group.cpp:419] Trying to create path '/mesos' in ZooKeeper WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152586 13791 group.cpp:758] Found non-sequence node 'log_replicas' at '/mesos' in ZooKeeper WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152662 13791 detector.cpp:152] Detected a new leader: (id='0') WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152762 13791 group.cpp:700] Trying to get '/mesos/json.info_0000000000' in ZooKeeper WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157148 13791 zookeeper.cpp:262] A new leading master (UPID=master@172.16.10.95:32856) is detected WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157347 13787 sched.cpp:336] New master detected at master@172.16.10.95:32856 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157557 13787 sched.cpp:352] No credentials provided. Attempting to register without authentication WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157565 13787 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157635 13787 sched.cpp:869] Will retry registration in 0ns if necessary WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.158979 13785 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159029 13785 sched.cpp:869] Will retry registration in 0ns if necessary WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159265 13790 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159303 13790 sched.cpp:869] Will retry registration in 0ns if necessary WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159479 13786 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159521 13786 sched.cpp:869] Will retry registration in 0ns if necessary WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159622 13788 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159658 13788 sched.cpp:869] Will retry registration in 0ns if necessary WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159749 13789 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159785 13789 sched.cpp:869] Will retry registration in 0ns if necessary WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159878 13792 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856 WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159916 13792 sched.cpp:869] Will retry registration in 0ns if necessary
In Marathon, when we are running our tests, we set the failoverTimeout to 0 in order to cause the Mesos master to immediately forget about a framework when it disconnects.
On line 860 of sched.cpp, the retry-delay is set to 1/10th the failoverTimeout, which provides the best explanation for why the value is 0:
./mesos/src/sched/sched.cpp 818 | void doReliableRegistration(Duration maxBackoff) 819 | { ... 851 | // Bound the maximum backoff by 'REGISTRATION_RETRY_INTERVAL_MAX'. 852 | maxBackoff = 853 | std::min(maxBackoff, scheduler::REGISTRATION_RETRY_INTERVAL_MAX); 854 | 855 | // If failover timeout is present, bound the maximum backoff 856 | // by 1/10th of the failover timeout. 857 | if (framework.has_failover_timeout()) { 858 | Try<Duration> duration = Duration::create(framework.failover_timeout()); 859 | if (duration.isSome()) { 860 | maxBackoff = std::min(maxBackoff, duration.get() / 10); 861 | } 862 | } 863 | 864 | // Determine the delay for next attempt by picking a random 865 | // duration between 0 and 'maxBackoff'. 866 | // TODO(vinod): Use random numbers from <random> header. 867 | Duration delay = maxBackoff * ((double) os::random() / RAND_MAX); 868 | 869 | VLOG(1) << "Will retry registration in " << delay << " if necessary"; 870 | 871 | // Backoff. 872 | frameworkRegistrationTimer = process::delay( 873 | delay, self(), &Self::doReliableRegistration, maxBackoff * 2); 874 | } 875 |
Reading through the code, it seems that once this value is 0, it will always be zero, since backoff is multiplicative (0 * 2 == 0), and the failover_timeout / 10 limit is applied each time.
To make matters worse, failoverTimeout of 0 is the default:
./mesos/include/mesos/mesos.proto 238 | // The amount of time (in seconds) that the master will wait for the 239 | // scheduler to failover before it tears down the framework by 240 | // killing all its tasks/executors. This should be non-zero if a 241 | // framework expects to reconnect after a failure and not lose its 242 | // tasks/executors. 243 | // 244 | // NOTE: To avoid accidental destruction of tasks, production 245 | // frameworks typically set this to a large value (e.g., 1 week). 246 | optional double failover_timeout = 4 [default = 0.0]; 247 |
I've confirmed that when using 1.4.0 of the Mesos client java jar, this default is used if failoverTimeout is not set:
@ import $ivy.`org.apache.mesos:mesos:1.4.0` import $ivy.$ @ import org.apache.mesos.Protos import org.apache.mesos.Protos @ Protos.FrameworkInfo.newBuilder.setName("test").setUser("user").build.getFailoverTimeout res3: Double = 0.0