Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9054

Scheduler driver hangs on syncing its state with ZooKeeper during Master detection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • scheduler driver
    • None

    Description

      A framework (namely, Marathon) uses scheduler driver (V0 API) to connect to the Mesos master, but never receives `registered()`. The hanging framework prints the following messages:

      2018-06-26 05:30:23: I0626 05:30:23.899340 14465 sched.cpp:232] Version: 1.4.0
      2018-06-26 05:30:24: I0626 05:30:24.022102 14523 group.cpp:341] Group process (zookeeper-group(1)@10.136.5.234:15101) connected to ZooKeeper
      2018-06-26 05:30:24: I0626 05:30:24.022148 14523 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      2018-06-26 05:30:24: I0626 05:30:24.022166 14523 group.cpp:419] Trying to create path '/mesos' in ZooKeeper

      When the framework calls `scheduler_driver->start()`, it creates and spawns `ZooKeeperMasterDetectorProcess`, which creates a detector of type `zookeeper::Group`. After the detector connects to ZK, it calls `zookeeper::Group::connected()`. Then, `Group` tries to `sync()`, which calls `create()`. At this point we call `zk->create()`, which is a synchronous call, see `dispatch(...).get()`.

      Since ZK library or ZK itself might hang, a scheduler driver can stuck in this state, so a framework will never receive any callbacks.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              abudnik Andrei Budnik
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: