[MESOS-9556] Establish a well-defined agent state diagram - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Accepted
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: master
Labels:

Epic Link:
Mesos Agent Lifecycle
Story Points:
8

Description

The agent's lifecycle is currently not well-defined. There are some agent states which are not represented with distinct agent state values in the code, and we have no documentation which clearly lays out the state diagram for an agent, including the events which will transition an agent from one state to another.

We should design this state diagram to ensure that all agents are always in a well-defined state which is represented in the code and visible to users via our APIs.

This work will include examining the Master::_removeSlave() function, which currently handles three cases of agent removal:

Starting maintenance on an agent via the 'startMaintenance()' handler
When an agent submits a new registration from a previously-known IP:port, via the _registerSlave() method (aka the 'deleted latest symlink' case)
When an agent shuts itself down via an UnregisterSlaveMessage (aka the SIGUSR1 case)

In these cases, the agent is not transitioned to a new state in the master, it is simply removed. We should define agent states for these cases and ensure that the master stores these agent IDs and/or agent infos.

The outcome of this ticket should be a design doc describing the agent state diagram, and a high-level view of how this could be implemented. New tickets for the implementation should also be created.

Attachments

Issue Links

blocks

MESOS-9541 Transition agent operations to some "lost" state when the agent is removed.

Open

Activity

People

Assignee:: Unassigned

Reporter:: Greg Mann

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Feb/19 16:43

Updated:: 23/Jan/20 18:17