[MESOS-10007] Command executor can miss exit status for short-lived commands due to double-reaping. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.9.1, 1.10.0
Component/s: executor, libprocess
Labels:
- foundations

Target Version/s:

1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.9.1
Sprint:
Foundations: RI-19 57, Resource Mgmt: RI-20 58
Story Points:
1

Description

Hi,

While testing Mesos to see if we could use it at work, I encountered a random bug which I believe happens when a command exits really quickly, when run via the command executor.

See the attached test case, but basically all it does is constantly start "exit 0" tasks.

At some point, a task randomly fails with the error "Failed to get exit status for Command":

'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 'source': 'SOURCE_EXECUTOR',

I've had a look at the code, and I found something which could potentially explain it - it's the first time I look at the code so apologies if I'm missing something.

We can see the error originates from `reaped`:

https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017

    } else if (status_->isNone()) {
      taskState = TASK_FAILED;
      message = "Failed to get exit status for Command";
    } else {

Looking at the code, we can see that the `status_` future can be set to `None` in `ReaperProcess::reap`:

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69

Future<Option<int>> ReaperProcess::reap(pid_t pid)
{
  // Check to see if this pid exists.
  if (os::exists(pid)) {
    Owned<Promise<Option<int>>> promise(new Promise<Option<int>>());
    promises.put(pid, promise);
    return promise->future();
  } else {
    return None();
  }
}

So we could have this if the process has already been reaped (`kill -0` will fail).

Now, looking at the code path which spawns the process:

`launchTaskSubprocess`

https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724

calls `subprocess`:

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315

If we look at the bottom of the function we can see the following:

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462

  // We need to bind a copy of this Subprocess into the onAny callback
  // below to ensure that we don't close the file descriptors before
  // the subprocess has terminated (i.e., because the caller doesn't
  // keep a copy of this Subprocess around themselves).
  process::reap(process.data->pid)
    .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));  return process;

So at this point we've already called `process::reap`.

And after that, the executor also calls `process::reap`:

https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801

    // Monitor this process.
    process::reap(pid.get())
      .onAny(defer(self(), &Self::reaped, pid.get(), lambda::_1));

But if we look at the implementation of `process::reap`:

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152

Future<Option<int>> reap(pid_t pid)
{
  // The reaper process is instantiated in `process::initialize`.
  process::initialize();  return dispatch(
      internal::reaper,
      &internal::ReaperProcess::reap,
      pid);
}

We can see that `ReaperProcess::reap` is going to get called asynchronously.

Doesn't this mean that it's possible that the first call to `reap` set up by `subprocess` (https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)

will get executed first, and if the task has already exited by that time, the child will get reaped before the call to `reap` set up by the executor (https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801) gets a chance to run?

In that case, when it runs

if (os::exists(pid)) {

would return false, `reap` would set the future to None which would result in this error.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test_scheduler.py
04/Oct/19 16:04
2 kB
Charles N
executor_race_reprod.diff
19/Oct/19 11:24
0.4 kB
Charles N
0001-Avoid-double-reaping-race-in-the-command-executor.patch
21/Oct/19 23:33
2 kB
Benjamin Mahler

Activity

People

Assignee:: Benjamin Mahler

Reporter:: Charles N

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Oct/19 16:34

Updated:: 23/Oct/19 19:14

Resolved:: 23/Oct/19 17:37