Affects Version/s: 1.2.0
Fix Version/s: None
Sprint:Mesosphere Sprint 68
Currently, when a master re-registers an agent that was marked unreachable, it shutdowns all not partition-aware frameworks on that agent. When a master re-registers an agent that is already registered, it doesn't check that all tasks from the slave's re-registration message are known to it.
It is possible that due to a transient loss of connectivity an agent may miss SlaveReregisteredMessage along with ShutdownFrameworkMessage and thus will not kill not partition-aware tasks. But the master will mark the agent as registered and will not re-add tasks that it thought will be killed. The agent may re-register again, this time successfully, before becoming marked unreachable while never having terminated tasks of not partition-aware frameworks. The master will simply forget those tasks ever existed, because it has "removed" them during the previous re-registration.
- Connection from the master to the agent stops working
- Agent doesn't see pings from the master and attempts to re-register
- Master sends SlaveRegisteredMessage and ShutdownSlaveMessage, which don't get to the agent because of the connection failure. Agent is marked registered.
- Network issue resolves, connection breaks. Agent retries re-registration.
- Master thinks that the agent was registered since step (3) and just re-sends SlaveRegisteredMessage. Tasks remain running on the agent.
One of the possible solutions would be to compare the list of tasks the the already registered agent reports in ReregisterSlaveMessage and the list of tasks the master has. In this case anything that the master doesn't know about should not exist on the agent.