Details
-
Improvement
-
Status: Accepted
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
The agent's lifecycle is currently not well-defined. There are some agent states which are not represented with distinct agent state values in the code, and we have no documentation which clearly lays out the state diagram for an agent, including the events which will transition an agent from one state to another.
We should design this state diagram to ensure that all agents are always in a well-defined state which is represented in the code and visible to users via our APIs.
This work will include examining the Master::_removeSlave() function, which currently handles three cases of agent removal:
- Starting maintenance on an agent via the 'startMaintenance()' handler
- When an agent submits a new registration from a previously-known IP:port, via the _registerSlave() method (aka the 'deleted latest symlink' case)
- When an agent shuts itself down via an UnregisterSlaveMessage (aka the SIGUSR1 case)
In these cases, the agent is not transitioned to a new state in the master, it is simply removed. We should define agent states for these cases and ensure that the master stores these agent IDs and/or agent infos.
The outcome of this ticket should be a design doc describing the agent state diagram, and a high-level view of how this could be implemented. New tickets for the implementation should also be created.
Attachments
Issue Links
- blocks
-
MESOS-9541 Transition agent operations to some "lost" state when the agent is removed.
- Open