Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5576

Masters may drop the first message they send between masters after a network partition

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.28.2
    • Fix Version/s: 0.28.3, 1.0.0
    • Labels:
    • Environment:

      Observed in an OpenStack environment where each master lives on a separate VM.

    • Sprint:
      Mesosphere Sprint 38
    • Story Points:
      5

      Description

      We observed the following situation in a cluster of five masters:

      Time Master 1 Master 2 Master 3 Master 4 Master 5
      0 Follower Follower Follower Follower Leader
      1 Follower Follower Follower Follower Partitioned from cluster by downing this VM's network
      2 Elected Leader by ZK Voting Voting Voting Suicides due to lost leadership
      3 Performs consensus Replies to leader Replies to leader Replies to leader Still down
      4 Performs writing Acks to leader Acks to leader Acks to leader Still down
      5 Leader Follower Follower Follower Still down
      6 Leader Follower Follower Follower Comes back up
      7 Leader Follower Follower Follower Follower
      8 Partitioned in the same way as Master 5 Follower Follower Follower Follower
      9 Suicides due to lost leadership Elected Leader by ZK Follower Follower Follower
      10 Still down Performs consensus Replies to leader Replies to leader Doesn't get the message!
      11 Still down Performs writing Acks to leader Acks to leader Acks to leader
      12 Still down Leader Follower Follower Follower

      Master 2 sends a series of messages to the recently-restarted Master 5. The first message is dropped, but subsequent messages are not dropped.

      This appears to be due to a stale link between the masters. Before leader election, the replicated log actors create a network watcher, which adds links to masters that join the ZK group:
      https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159

      This link does not appear to break (Master 2 -> 5) when Master 5 goes down, perhaps due to how the network partition was induced (in the hypervisor layer, rather than in the VM itself).

      When Master 2 tries to send an PromiseRequest to Master 5, we do not observe the expected log message

      Instead, we see a log line in Master 2:

      process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is not connected
      

      The broken link is removed by the libprocess socket_manager and the following WriteRequest from Master 2 to Master 5 succeeds via a new socket.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kaysoky Joseph Wu
                Reporter:
                kaysoky Joseph Wu
                Shepherd:
                Benjamin Mahler
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: