Details
Description
When the Master is asked to kill a task and it knows of the framework but it cannot locate the TaskID, the Master replies with TASK_LOST.
This is normally ok, however, consider a failed over Master:
--> Master fails over.
--> Framework F re-registers.
--> Slave with Task T in TASK_RUNNING has not yet re-registered.
--> Master::killTask(F, T) cannot find T and replies with TASK_LOST.
--> Slave re-registers with Task T in TASK_RUNNING.
--> Now we've told the framework the task was LOST but it is left RUNNING.
The simple fix here is to simply not reply in such cases and rely on a later reconciliation request.
In the presence of a stateful master (MESOS-764), we can reliably reply with TASK_LOST if the slave is not in the Registrar, otherwise we must remain silent as the slave will be possibly re-registering with the correct state of the TASK. Ideally we can postpone the kill task message for the slave so that once it re-registers we can send it, but this is a bit complicated to implement and reconciliation can help with this.
Attachments
Issue Links
- is related to
-
MESOS-1200 Add SlaveID to KillTaskMessage to provide feedback for unknown slaves.
- Resolved