Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8171

Using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop

    XMLWordPrintableJSON

Details

    • Mesosphere Sprint 67, Mesosphere Sprint 68, Mesosphere Sprint 69
    • 2

    Description

      Over the past year, the Marathon team has been plagued with an issue that hits our CI builds periodically in which the scheduler driver enters a tight loop, sending 10,000s of SUBSCRIBE calls to the master per second. I turned on debug logging for the client and the server, and it pointed to an issue with the doReliableRegistration method in sched.cpp. Here's the logs:

      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.099815 13397 process.cpp:1383] libprocess is initialized on 127.0.1.1:60957 with 8 worker threads
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.118237 13397 logging.cpp:199] Logging to STDERR
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.128921 13416 sched.cpp:232] Version: 1.4.0
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151785 13791 group.cpp:341] Group process (zookeeper-group(1)@127.0.1.1:60957) connected to ZooKeeper
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151823 13791 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151837 13791 group.cpp:419] Trying to create path '/mesos' in ZooKeeper
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152586 13791 group.cpp:758] Found non-sequence node 'log_replicas' at '/mesos' in ZooKeeper
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152662 13791 detector.cpp:152] Detected a new leader: (id='0')
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152762 13791 group.cpp:700] Trying to get '/mesos/json.info_0000000000' in ZooKeeper
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157148 13791 zookeeper.cpp:262] A new leading master (UPID=master@172.16.10.95:32856) is detected
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157347 13787 sched.cpp:336] New master detected at master@172.16.10.95:32856
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157557 13787 sched.cpp:352] No credentials provided. Attempting to register without authentication
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157565 13787 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157635 13787 sched.cpp:869] Will retry registration in 0ns if necessary
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.158979 13785 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159029 13785 sched.cpp:869] Will retry registration in 0ns if necessary
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159265 13790 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159303 13790 sched.cpp:869] Will retry registration in 0ns if necessary
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159479 13786 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159521 13786 sched.cpp:869] Will retry registration in 0ns if necessary
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159622 13788 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159658 13788 sched.cpp:869] Will retry registration in 0ns if necessary
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159749 13789 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159785 13789 sched.cpp:869] Will retry registration in 0ns if necessary
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159878 13792 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
      WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159916 13792 sched.cpp:869] Will retry registration in 0ns if necessary
      

      In Marathon, when we are running our tests, we set the failoverTimeout to 0 in order to cause the Mesos master to immediately forget about a framework when it disconnects.

      On line 860 of sched.cpp, the retry-delay is set to 1/10th the failoverTimeout, which provides the best explanation for why the value is 0:

      ./mesos/src/sched/sched.cpp
      
       818 |   void doReliableRegistration(Duration maxBackoff)
       819 |   {
      ...
       851 |     // Bound the maximum backoff by 'REGISTRATION_RETRY_INTERVAL_MAX'.
       852 |     maxBackoff =
       853 |       std::min(maxBackoff, scheduler::REGISTRATION_RETRY_INTERVAL_MAX);
       854 | 
       855 |     // If failover timeout is present, bound the maximum backoff
       856 |     // by 1/10th of the failover timeout.
       857 |     if (framework.has_failover_timeout()) {
       858 |       Try<Duration> duration = Duration::create(framework.failover_timeout());
       859 |       if (duration.isSome()) {
       860 |         maxBackoff = std::min(maxBackoff, duration.get() / 10);
       861 |       }
       862 |     }
       863 | 
       864 |     // Determine the delay for next attempt by picking a random
       865 |     // duration between 0 and 'maxBackoff'.
       866 |     // TODO(vinod): Use random numbers from <random> header.
       867 |     Duration delay = maxBackoff * ((double) os::random() / RAND_MAX);
       868 | 
       869 |     VLOG(1) << "Will retry registration in " << delay << " if necessary";
       870 | 
       871 |     // Backoff.
       872 |     frameworkRegistrationTimer = process::delay(
       873 |         delay, self(), &Self::doReliableRegistration, maxBackoff * 2);
       874 |   }
       875 | 
      

      Reading through the code, it seems that once this value is 0, it will always be zero, since backoff is multiplicative (0 * 2 == 0), and the failover_timeout / 10 limit is applied each time.

      To make matters worse, failoverTimeout of 0 is the default:

      ./mesos/include/mesos/mesos.proto
      
       238 |   // The amount of time (in seconds) that the master will wait for the
       239 |   // scheduler to failover before it tears down the framework by
       240 |   // killing all its tasks/executors. This should be non-zero if a
       241 |   // framework expects to reconnect after a failure and not lose its
       242 |   // tasks/executors.
       243 |   //
       244 |   // NOTE: To avoid accidental destruction of tasks, production
       245 |   // frameworks typically set this to a large value (e.g., 1 week).
       246 |   optional double failover_timeout = 4 [default = 0.0];
       247 | 
      

      I've confirmed that when using 1.4.0 of the Mesos client java jar, this default is used if failoverTimeout is not set:

      @ import $ivy.`org.apache.mesos:mesos:1.4.0`
      import $ivy.$
      
      @ import org.apache.mesos.Protos
      import org.apache.mesos.Protos
      
      @ Protos.FrameworkInfo.newBuilder.setName("test").setUser("user").build.getFailoverTimeout
      res3: Double = 0.0
      

      Attachments

        Activity

          People

            mzhu Meng Zhu
            timcharper Tim Harper
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: