Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10007

Command executor can miss exit status for short-lived commands due to double-reaping.

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Hi,

      While testing Mesos to see if we could use it at work, I encountered a random bug which I believe happens when a command exits really quickly, when run via the command executor.

      See the attached test case, but basically all it does is constantly start "exit 0" tasks.

      At some point, a task randomly fails with the error "Failed to get exit status for Command":

       

      'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', 'source': 'SOURCE_EXECUTOR',

        

      I've had a look at the code, and I found something which could potentially explain it - it's the first time I look at the code so apologies if I'm missing something.

       We can see the error originates from `reaped`:

      https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017

          } else if (status_->isNone()) {
            taskState = TASK_FAILED;
            message = "Failed to get exit status for Command";
          } else {

       

      Looking at the code, we can see that the `status_` future can be set to `None` in `ReaperProcess::reap`:

      https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69

       

       

      Future<Option<int>> ReaperProcess::reap(pid_t pid)
      {
        // Check to see if this pid exists.
        if (os::exists(pid)) {
          Owned<Promise<Option<int>>> promise(new Promise<Option<int>>());
          promises.put(pid, promise);
          return promise->future();
        } else {
          return None();
        }
      }

       

       

      So we could have this if the process has already been reaped (`kill -0` will fail).

       

      Now, looking at the code path which spawns the process:

      `launchTaskSubprocess`

      https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724

       

      calls `subprocess`:

      https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315

       

      If we look at the bottom of the function we can see the following:

      https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462

       

       

        // We need to bind a copy of this Subprocess into the onAny callback
        // below to ensure that we don't close the file descriptors before
        // the subprocess has terminated (i.e., because the caller doesn't
        // keep a copy of this Subprocess around themselves).
        process::reap(process.data->pid)
          .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process));  return process;

       

       

      So at this point we've already called `process::reap`.

       

      And after that, the executor also calls `process::reap`:

      https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801

       

       

          // Monitor this process.
          process::reap(pid.get())
            .onAny(defer(self(), &Self::reaped, pid.get(), lambda::_1));

       

       

      But if we look at the implementation of `process::reap`:

      https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152

       

       

      Future<Option<int>> reap(pid_t pid)
      {
        // The reaper process is instantiated in `process::initialize`.
        process::initialize();  return dispatch(
            internal::reaper,
            &internal::ReaperProcess::reap,
            pid);
      }

      We can see that `ReaperProcess::reap` is going to get called asynchronously.

       

      Doesn't this mean that it's possible that the first call to `reap` set up by `subprocess` (https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)

      will get executed first, and if the task has already exited by that time, the child will get reaped before the call to `reap` set up by the executor (https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801) gets a chance to run?

       

      In that case, when it runs

       

      if (os::exists(pid)) {

      would return false, `reap` would set the future to None which would result in this error.

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bmahler Benjamin Mahler
            Charle Charles N
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprints:
                Foundations: RI-19 57 ended 23/Oct/19
                Resource Mgmt: RI-20 58 ended 06/Nov/19
                View on Board

                Slack

                  Issue deployment