[MESOS-9887] Race condition between two terminal task status updates for Docker/Command executor. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.3, 1.7.3, 1.9.0, 1.8.2
Component/s: agent, containerization
Labels:
- agent
- containerization

Target Version/s:

1.6.3, 1.7.3, 1.9.0, 1.8.2
Sprint:
Containerization: RI-17 52, Containerization: RI-17 53
Story Points:
8

Description

Overview

Expected behavior:
Task successfully finishes and sends TASK_FINISHED status update.

Observed behavior:
Task successfully finishes, but the agent sends TASK_FAILED with the reason "REASON_EXECUTOR_TERMINATED".

In normal circumstances, Docker executor sends final status update TASK_FINISHED to the agent, which then gets processed by the agent before termination of the executor's process.

However, if the processing of the initial TASK_FINISHED gets delayed, then there is a chance that Docker executor terminates and the agent triggers TASK_FAILED which will be handled prior to the TASK_FINISHED status update.

See attached logs which contain an example of the race condition.

Reproducing bug

1. Add the following code:

  static int c = 0;
  if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
        ::sleep(2);
  }

to the `ComposingContainerizerProcess::status`
and to the `DockerContainerizerProcess::status`.

2. Recompile mesos

3. Launch mesos master and agent locally

4. Launch a simple Docker task via `mesos-execute`:

#  cd build
./src/mesos-execute --master="`hostname`:5050" --name="a" --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" --command="ls"

Race condition - description

1. Mesos agent receives TASK_FINISHED status update and then subscribes on `containerizer->status()`.

2. `containerizer->status()` operation for TASK_FINISHED status update gets delayed in the composing containerizer (e.g. due to switch of the worker thread that executes `status` method).

3. Docker executor terminates and the agent triggers TASK_FAILED.

4. Docker containerizer destroys the container. A registered callback for the `containerizer->wait` call in the composing containerizer dispatches lambda function that will clean up `containers_` map.

5. Composing c'zer resumes and dispatches `status()` method to the Docker containerizer for TASK_FINISHED, which in turn hangs for a few seconds.

6. Corresponding `containerId` gets removed from the `containers_` map of the composing c'zer.

7. Mesos agent subscribes on `containerizer->status()` for the TASK_FAILED status update.

8. Composing c'zer returns "Container not found" for TASK_FAILED.

9. `Slave::_statusUpdate` stores TASK_FAILED terminal status update in the executor's data structure.

10. Docker containerizer resumes and finishes processing of `status()` method for TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` continuation. This method discovers that the executor has already been destroyed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

race_example.txt
10/Jul/19 16:51
4 kB
Andrei Budnik

Issue Links

relates to

MESOS-9847 Docker executor doesn't wait for status updates to be ack'd before shutting down.

Resolved

Activity

People

Assignee:: Andrei Budnik

Reporter:: Andrei Budnik

Shepherd:: Gilbert Song

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Jul/19 17:46

Updated:: 26/Aug/19 13:10

Resolved:: 26/Aug/19 13:10