Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
If an agent fails health checks, it is removed from the cluster. The next time the agent connects to the master, it is instructed to shutdown and all tasks/executors are killed. The next time the agent is started, it will be assigned a new agent ID. Any persistent volumes from the previous agent instance will be preserved, but they will now be associated with a new agent ID.
This is problematic because volume IDs do not need to be globally unique. Hence, it is natural for frameworks to use the pair <agent-id, volume-id> to uniquely identify a volume. If volume k moves from agent foo to agent bar, it is hard for frameworks to determine whether <bar,k> is the "same" volume that was previously called <foo,k> (they might be able to figure this out from `slaveLost` callbacks, but those aren't reliable). Similarly, the HTTP endpoints for volumes and dynamic reservations include a slave ID.
Attachments
Issue Links
- is related to
-
MESOS-4049 Allow user to control behavior of partitioned agents/tasks
- Resolved
- relates to
-
MESOS-5368 Consider introducing persistent agent ID
- Open