Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
1.5.0
-
None
-
Mesosphere Sprint 76, Mesosphere Sprint 77
-
8
Description
Tasks may be explicitly dropped by the agent if all the following conditions are met:
(1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls use the same executor.
(2) The executor currently does not exist on the agent.
(3) Due to some race conditions, these tasks are trying to launch on the agent in a different order from their original launch order. (See below how this could happen)
In this case, tasks that are trying to launch on the agent before the first task in the original order will be explicitly dropped by the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).
Up until now, Mesos does not guarantee in-order task launch on the agent. Let's say Mesos master sends two `launchTask` messages (launch Task1 and Task2) to an agent. In most cases (except MESOS-3870), these messages are delivered to the agent in order. However, currently, there are two asynchronous steps (unschedule GC and task authorization) in the agent task launch path. Depending on the CPU scheduling order, task2 launch may finish these two steps earlier than task1 and get to the launch executor stage before task1.
In this case, prior to MESOS-1720, these two tasks will still get launched. If task1 and task2 use the same executor, whoever reaches the launch executor stage first, will launch the executor.
However, after resolving MESOS-1720, agents start to enforce some order for tasks using the same executor. Specifically, when master crafts the launch task message, it will specify the `launch_executor` flag. Thus Task1 in the above case will have `launch_executor` flag set to true. And task2 (and any subsequent tasks that use the same executor) will have the flag set to false.
If task2 reaches the launch executor stage before task1 (due to the race condition described above), the agent will see that its `launch_executor ` is false but the executor specified in the `launchTask` message is not running. As a result, it will explicitly drop task2 as in:
Based on discussion with chhsia0 and bmahler, we should take an explicit approach of using process:: Sequence to ensure ordered task delivery (on both the master and agent).
Attachments
Issue Links
- causes
-
MESOS-8617 Tests using default executor occasionally fail.
- Resolved
- is caused by
-
MESOS-1720 Slave should send exited executor message when the executor is never launched.
- Resolved