Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9426

ZK master detection can become forever pending.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Accepted
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0, 1.5.1
    • Fix Version/s: None
    • Component/s: agent
    • Story Points:
      5

      Description

      The following agent logs are observed on an agent that cannot join the cluster after a network partition:

      $ grep ' \(detector\|group\)\.cpp:\| slave\.cpp:59' agent.log
      I1129 06:54:19.485293 11393 detector.cpp:152] Detected a new leader: (id='18')
      I1129 06:54:19.485395 11390 detector.cpp:152] Detected a new leader: (id='18')
      I1129 06:54:19.485473 11390 group.cpp:700] Trying to get '/mesos/json.info_0000000018' in ZooKeeper
      I1129 06:54:19.485400 11386 group.cpp:700] Trying to get '/mesos/json.info_0000000018' in ZooKeeper
      I1129 06:54:43.256572 11392 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:54:43.256640 11392 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:54:47.792897 11392 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:54:47.792951 11392 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:54:55.266717 11386 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:54:55.266741 11386 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:55:04.069279 11386 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:55:04.069341 11386 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:55:12.563385 11392 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:55:12.563474 11392 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:55:21.723659 11393 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:55:21.723685 11393 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:55:27.837906 11392 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:55:27.837945 11392 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:55:58.174341 11389 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:55:58.174399 11389 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:56:12.829675 11386 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:56:12.829730 11386 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:56:45.453513 11387 group.cpp:452] Lost connection to ZooKeeper, attempting to reconnect ...
      I1129 06:56:45.455945 11393 group.cpp:452] Lost connection to ZooKeeper, attempting to reconnect ...
      I1129 06:56:46.434237 11390 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:56:46.434264 11390 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:56:48.826885 11392 group.cpp:341] Group process (zookeeper-group(2)@XXX.XXX.38.217:5051) reconnected to ZooKeeper
      I1129 06:56:48.826938 11392 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      I1129 06:56:48.831858 11386 group.cpp:341] Group process (zookeeper-group(1)@XXX.XXX.38.217:5051) reconnected to ZooKeeper
      I1129 06:56:48.831902 11386 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
      I1129 06:57:09.853693 11391 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:57:09.853744 11391 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected
      I1129 06:57:39.439877 11393 slave.cpp:5922] No pings from master received within 5mins
      I1129 06:58:25.994642 11386 slave.cpp:5932] Got exited event for master@XXX.XXX.36.94:5050
      W1129 06:58:25.994715 11386 slave.cpp:5937] Master disconnected! Waiting for a new master to be elected

      The hypothesis is that, when the leading master and an agent get in the same network partition, and the other two masters form a new quorum, the future returned from the following line can become forever pending on the partitioned agent:
      https://github.com/apache/mesos/blob/bf4e8b392b3fa58ffdbf5f14ce3f0ba7a1674a0c/src/zookeeper/group.cpp#L316

      One possible resolution is to trigger an update periodically. Another possibility is to make the master to check the quorum periodically and commit suicide if it is partitioned to trigger a ZK disconnection on any connected agent.

      NOTE: At the creation of the ticket, I haven't got the log file of the partitioned master at the time frame when the network partition happened. I'll update the ticket once I have more information.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              chhsia0 Chun-Hung Hsiao
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: