Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9887

Race condition between two terminal task status updates for Docker/Command executor.

    XMLWordPrintableJSON

    Details

    • Target Version/s:
    • Sprint:
      Containerization: RI-17 52, Containerization: RI-17 53
    • Story Points:
      8

      Description

      Overview

      Expected behavior:
      Task successfully finishes and sends TASK_FINISHED status update.

      Observed behavior:
      Task successfully finishes, but the agent sends TASK_FAILED with the reason "REASON_EXECUTOR_TERMINATED".

      In normal circumstances, Docker executor sends final status update TASK_FINISHED to the agent, which then gets processed by the agent before termination of the executor's process.

      However, if the processing of the initial TASK_FINISHED gets delayed, then there is a chance that Docker executor terminates and the agent triggers TASK_FAILED which will be handled prior to the TASK_FINISHED status update.

      See attached logs which contain an example of the race condition.

      Reproducing bug

      1. Add the following code:

        static int c = 0;
        if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
              ::sleep(2);
        }
      

      to the `ComposingContainerizerProcess::status`
      and to the `DockerContainerizerProcess::status`.

      2. Recompile mesos

      3. Launch mesos master and agent locally

      4. Launch a simple Docker task via `mesos-execute`:

      #  cd build
      ./src/mesos-execute --master="`hostname`:5050" --name="a" --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" --command="ls"
      

      Race condition - description

      1. Mesos agent receives TASK_FINISHED status update and then subscribes on `containerizer->status()`.

      2. `containerizer->status()` operation for TASK_FINISHED status update gets delayed in the composing containerizer (e.g. due to switch of the worker thread that executes `status` method).

      3. Docker executor terminates and the agent triggers TASK_FAILED.

      4. Docker containerizer destroys the container. A registered callback for the `containerizer->wait` call in the composing containerizer dispatches lambda function that will clean up `containers_` map.

      5. Composing c'zer resumes and dispatches `status()` method to the Docker containerizer for TASK_FINISHED, which in turn hangs for a few seconds.

      6. Corresponding `containerId` gets removed from the `containers_` map of the composing c'zer.

      7. Mesos agent subscribes on `containerizer->status()` for the TASK_FAILED status update.

      8. Composing c'zer returns "Container not found" for TASK_FAILED.

      9. `Slave::_statusUpdate` stores TASK_FAILED terminal status update in the executor's data structure.

      10. Docker containerizer resumes and finishes processing of `status()` method for TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` continuation. This method discovers that the executor has already been destroyed.

        Attachments

        1. race_example.txt
          4 kB
          Andrei Budnik

          Issue Links

            Activity

              People

              • Assignee:
                abudnik Andrei Budnik
                Reporter:
                abudnik Andrei Budnik
                Shepherd:
                Gilbert Song
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: