Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Mesos Q3 Sprint 5, Mesos Q3 Sprint 6, Twitter Q4 Sprint 1
-
3
Description
As we update the Master to keep tasks in memory until they are both terminal and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as follows:
Master Slave {} {} {Tn} {} // Master receives Task T, non-terminal. Forwards to slave. {Tn} {Tn} // Slave receives Task T, non-terminal. {Tn} {Tt} // Task becomes terminal on slave. Update forwarded. {Tt} {Tt} // Master receives update, forwards to framework. {} {Tt} // Master receives ack, forwards to slave. {} {} // Slave receives ack.
In the current form of reconciliation, the slave sends to the master all tasks that are not both terminal and acknowledged. At any point in the above lifecycle, the slave's re-registration message can reach the master.
Note the following properties:
(1) The master may have a non-terminal task, not present in the slave's re-registration message.
(2) The master may have a non-terminal task, present in the slave's re-registration message but in a different state.
(3) The slave's re-registration message may contain a terminal unacknowledged task unknown to the master.
In the current master / slave reconciliation code, the master assumes that case (1) is because a launch task message was dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when the task reaches the slave correctly, so this can lead to inconsistency!
After chatting with vinodkone, we're considering updating the reconciliation to occur as follows:
→ Slave sends all tasks that are not both terminal and acknowledged, during re-registration. This is the same as before.
→ If the master sees tasks that are missing in the slave, the master sends the tasks that need to be reconciled to the slave for the tasks. This can be piggy-backed on the re-registration message.
→ The slave will send TASK_LOST if the task is not known to it. Preferably in a retried manner, unless we update socket closure on the slave to force a re-registration.
Attachments
Issue Links
- relates to
-
MESOS-1799 Reconciliation can send out-of-order updates.
- Resolved