Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10007

Command executor can miss exit status for short-lived commands due to double-reaping.

    XMLWordPrintableJSON

Details

    Description

      Hi,

      While testing Mesos to see if we could use it at work, I encountered a random bug which I believe happens when a command exits really quickly, when run via the command executor.

      See the attached test case, but basically all it does is constantly start "exit 0" tasks.

      At some point, a task randomly fails with the error "Failed to get exit status for Command":

       

      'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 'source': 'SOURCE_EXECUTOR',

        

      I've had a look at the code, and I found something which could potentially explain it - it's the first time I look at the code so apologies if I'm missing something.

       We can see the error originates from `reaped`:

      https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017

          } else if (status_->isNone()) {
            taskState = TASK_FAILED;
            message = "Failed to get exit status for Command";
          } else {

       

      Looking at the code, we can see that the `status_` future can be set to `None` in `ReaperProcess::reap`:

      https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69

       

       

      Future<Option<int>> ReaperProcess::reap(pid_t pid)
      {
        // Check to see if this pid exists.
        if (os::exists(pid)) {
          Owned<Promise<Option<int>>> promise(new Promise<Option<int>>());
          promises.put(pid, promise);
          return promise->future();
        } else {
          return None();
        }
      }

       

       

      So we could have this if the process has already been reaped (`kill -0` will fail).

       

      Now, looking at the code path which spawns the process:

      `launchTaskSubprocess`

      https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724

       

      calls `subprocess`:

      https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315

       

      If we look at the bottom of the function we can see the following:

      https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462

       

       

        // We need to bind a copy of this Subprocess into the onAny callback
        // below to ensure that we don't close the file descriptors before
        // the subprocess has terminated (i.e., because the caller doesn't
        // keep a copy of this Subprocess around themselves).
        process::reap(process.data->pid)
          .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));  return process;

       

       

      So at this point we've already called `process::reap`.

       

      And after that, the executor also calls `process::reap`:

      https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801

       

       

          // Monitor this process.
          process::reap(pid.get())
            .onAny(defer(self(), &Self::reaped, pid.get(), lambda::_1));

       

       

      But if we look at the implementation of `process::reap`:

      https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152

       

       

      Future<Option<int>> reap(pid_t pid)
      {
        // The reaper process is instantiated in `process::initialize`.
        process::initialize();  return dispatch(
            internal::reaper,
            &internal::ReaperProcess::reap,
            pid);
      }

      We can see that `ReaperProcess::reap` is going to get called asynchronously.

       

      Doesn't this mean that it's possible that the first call to `reap` set up by `subprocess` (https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)

      will get executed first, and if the task has already exited by that time, the child will get reaped before the call to `reap` set up by the executor (https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801) gets a chance to run?

       

      In that case, when it runs

       

      if (os::exists(pid)) {

      would return false, `reap` would set the future to None which would result in this error.

       

      Attachments

        1. test_scheduler.py
          2 kB
          Charles N
        2. executor_race_reprod.diff
          0.4 kB
          Charles N
        3. 0001-Avoid-double-reaping-race-in-the-command-executor.patch
          2 kB
          Benjamin Mahler

        Activity

          People

            bmahler Benjamin Mahler
            Charle Charles N
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: