[MESOS-8185] Tasks can be known to the agent but unknown to the master. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.0
Fix Version/s: None
Component/s: None
Labels:
- reliability

Sprint:
Mesosphere Sprint 68

Description

Currently, when a master re-registers an agent that was marked unreachable, it shutdowns all not partition-aware frameworks on that agent. When a master re-registers an agent that is already registered, it doesn't check that all tasks from the slave's re-registration message are known to it.

It is possible that due to a transient loss of connectivity an agent may miss SlaveReregisteredMessage along with ShutdownFrameworkMessage and thus will not kill not partition-aware tasks. But the master will mark the agent as registered and will not re-add tasks that it thought will be killed. The agent may re-register again, this time successfully, before becoming marked unreachable while never having terminated tasks of not partition-aware frameworks. The master will simply forget those tasks ever existed, because it has "removed" them during the previous re-registration.

Example scenario:

Connection from the master to the agent stops working
Agent doesn't see pings from the master and attempts to re-register
Master sends SlaveRegisteredMessage and ShutdownSlaveMessage, which don't get to the agent because of the connection failure. Agent is marked registered.
Network issue resolves, connection breaks. Agent retries re-registration.
Master thinks that the agent was registered since step (3) and just re-sends SlaveRegisteredMessage. Tasks remain running on the agent.

One of the possible solutions would be to compare the list of tasks the the already registered agent reports in ReregisterSlaveMessage and the list of tasks the master has. In this case anything that the master doesn't know about should not exist on the agent.

Attachments

Activity

People

Assignee:: Ilya

Reporter:: Ilya

Shepherd:: Benjamin Mahler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Nov/17 01:09

Updated:: 03/Jul/18 17:23

Resolved:: 03/Jul/18 17:23