Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8624

Valid tasks may be explicitly dropped by agent due to race conditions

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.5.0
    • 1.5.1, 1.6.0
    • None
    • Mesosphere Sprint 76, Mesosphere Sprint 77
    • 8

    Description

      Tasks may be explicitly dropped by the agent if all the following conditions are met:
      (1) Several `LAUNCH_TASK` or `LAUNCH_GROUP` calls use the same executor.
      (2) The executor currently does not exist on the agent.
      (3) Due to some race conditions, these tasks are trying to launch on the agent in a different order from their original launch order. (See below how this could happen)

      In this case, tasks that are trying to launch on the agent before the first task in the original order will be explicitly dropped by the agent (TASK_DROPPED` or `TASK_LOST` will be sent)).

      Up until now, Mesos does not guarantee in-order task launch on the agent. Let's say Mesos master sends two `launchTask` messages (launch Task1 and Task2) to an agent. In most cases (except MESOS-3870), these messages are delivered to the agent in order. However, currently, there are two asynchronous steps (unschedule GC and task authorization) in the agent task launch path. Depending on the CPU scheduling order, task2 launch may finish these two steps earlier than task1 and get to the launch executor stage before task1.

      In this case, prior to MESOS-1720, these two tasks will still get launched. If task1 and task2 use the same executor, whoever reaches the launch executor stage first, will launch the executor.

      However, after resolving MESOS-1720, agents start to enforce some order for tasks using the same executor. Specifically, when master crafts the launch task message, it will specify the `launch_executor` flag. Thus Task1 in the above case will have `launch_executor` flag set to true. And task2 (and any subsequent tasks that use the same executor) will have the flag set to false.

      If task2 reaches the launch executor stage before task1 (due to the race condition described above), the agent will see that its `launch_executor ` is false but the executor specified in the `launchTask` message is not running. As a result, it will explicitly drop task2 as in:

      https://github.com/apache/mesos/blob/32f6d4eec2724414e217875f4f7d3b2538db5381/src/slave/slave.cpp#L2888

      Based on discussion with Chun-Hung Hsiao and Benjamin Mahler, we should take an explicit approach of using process:: Sequence to ensure ordered task delivery (on both the master and agent).

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mzhu Meng Zhu
            mzhu Meng Zhu
            Greg Mann Greg Mann
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprints:
                Mesosphere Sprint 76 ended 30/Mar/18
                Mesosphere Sprint 77 ended 12/Apr/18
                View on Board

                Issue deployment