Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
At present, if an agent is partitioned away from the master, the master waits for a period of time (see MESOS-4048) before deciding that the agent is dead. Then it marks the agent as lost, sends TASK_LOST messages for all the tasks running on the agent, and instructs the agent to shutdown.
Although this behavior is desirable for some/many users, it is not ideal for everyone. For example:
- Some users might want to aggressively start a new replacement task (e.g., after one or two ping timeouts are missed); then when the old copy of the task comes back, they might want to make an intelligent decision about how to reconcile this situation (e.g., kill old, kill new, allow both to continue running).
- Some frameworks might want different behavior from other frameworks, or to treat some tasks differently from other tasks. For example, if a task has a huge amount of state that would need to be regenerated to spin up another instance, the user might want to wait longer before starting a new task to increase the chance that the old task will reappear.
To do this, we'd need to change task state so that a task can go from RUNNING to a new state (say UNKNOWN or WANDERING), and then from that state back to RUNNING (or perhaps we could keep the current "mark-lost-after-timeout" behavior as an option, in which case UNKNOWN could also transition to LOST). The agent would also keep its old slaveId when it reconnects.
Attachments
Issue Links
- incorporates
-
MESOS-5659 Design doc for TASK_UNREACHABLE
- Resolved
- is duplicated by
-
MESOS-4645 Mesos agent shutdown on healtcheck timeout rather than lost and recovered
- Resolved
- is related to
-
MESOS-4048 Consider unifying slave timeout behavior between steady state and master failover
- Accepted
- relates to
-
MESOS-4894 Volumes, reservations can move to new agent IDs after partition
- Open
-
MESOS-3545 Investigate restoring tasks/executors after machine reboot.
- Accepted
-
MESOS-4050 Change task reconciliation not omit unknown tasks
- Accepted
-
MESOS-4544 Propose design doc for agent partitioning behavior
- Resolved
- mentioned in
-
Page Loading...