Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5396

After failover, master does not remove agents with same UPID.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Accepted
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • master

    Description

      Scenario:

      • master fails over
      • an agent host is restarted; the agent attempts to register (not reregister) with Mesos using the same UPID as the previous agent instance; this means it will get a new agent ID
      • framework isn't notified about the status of the tasks on the old agentID until the agent_reregister_timeout expires (10 mins)

      This isn't necessarily wrong but it is suboptimal: when the agent attempts to register with the same UPID that was used by the previous agent instance, we know that a reregistration attempt for the old <UPID, agentID> pair will never be seen. Hence we can declare the old agentID to be gone-forever and notify frameworks appropriately, without waiting for the full agent_reregister_timeout to expire.

      Note that we already implement the proposed behavior for the case when the master does not failover (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              neilc Neil Conway
              Vinod Kone Vinod Kone
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: