Details
-
Improvement
-
Status: Accepted
-
Minor
-
Resolution: Unresolved
-
None
-
None
Description
Scenario:
- master fails over
- an agent host is restarted; the agent attempts to register (not reregister) with Mesos using the same UPID as the previous agent instance; this means it will get a new agent ID
- framework isn't notified about the status of the tasks on the old agentID until the agent_reregister_timeout expires (10 mins)
This isn't necessarily wrong but it is suboptimal: when the agent attempts to register with the same UPID that was used by the previous agent instance, we know that a reregistration attempt for the old <UPID, agentID> pair will never be seen. Hence we can declare the old agentID to be gone-forever and notify frameworks appropriately, without waiting for the full agent_reregister_timeout to expire.
Note that we already implement the proposed behavior for the case when the master does not failover (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).
Attachments
Issue Links
- is related to
-
MESOS-4048 Consider unifying slave timeout behavior between steady state and master failover
- Accepted
- is superceded by
-
MESOS-6223 Allow agents to re-register post a host reboot
- Resolved