Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8624

Valid tasks may be explicitly dropped by agent due to race conditions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.5.0
    • 1.5.1, 1.6.0
    • None
    • Mesosphere Sprint 76, Mesosphere Sprint 77
    • 8

    Description

      Tasks may be explicitly dropped by the agent if all the following conditions are met:
      (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls use the same executor.
      (2) The executor currently does not exist on the agent.
      (3) Due to some race conditions, these tasks are trying to launch on the agent in a different order from their original launch order. (See below how this could happen)

      In this case, tasks that are trying to launch on the agent before the first task in the original order will be explicitly dropped by the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).

      Up until now, Mesos does not guarantee in-order task launch on the agent. Let's say Mesos master sends two `launchTask` messages (launch Task1 and Task2) to an agent. In most cases (except MESOS-3870), these messages are delivered to the agent in order. However, currently, there are two asynchronous steps (unschedule GC and task authorization) in the agent task launch path. Depending on the CPU scheduling order, task2 launch may finish these two steps earlier than task1 and get to the launch executor stage before task1.

      In this case, prior to MESOS-1720, these two tasks will still get launched. If task1 and task2 use the same executor, whoever reaches the launch executor stage first, will launch the executor.

      However, after resolving MESOS-1720, agents start to enforce some order for tasks using the same executor. Specifically, when master crafts the launch task message, it will specify the `launch_executor` flag. Thus Task1 in the above case will have `launch_executor` flag set to true. And task2 (and any subsequent tasks that use the same executor) will have the flag set to false.

      If task2 reaches the launch executor stage before task1 (due to the race condition described above), the agent will see that its `launch_executor ` is false but the executor specified in the `launchTask` message is not running. As a result, it will explicitly drop task2 as in:

      https://github.com/apache/mesos/blob/32f6d4eec2724414e217875f4f7d3b2538db5381/src/slave/slave.cpp#L2888

      Based on discussion with chhsia0 and bmahler, we should take an explicit approach of using process:: Sequence to ensure ordered task delivery (on both the master and agent).

      Attachments

        Issue Links

          Activity

            People

              mzhu Meng Zhu
              mzhu Meng Zhu
              Greg Mann Greg Mann
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: