Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5195

Docker executor: task logs lost on shutdown

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.27.2
    • None
    • containerization, docker
    • None
    • Linux 4.4.2 "Ubuntu 14.04.2 LTS"

    Description

      When you try to kill a task running in the Docker executor (in our case via Singularity), the task shuts down cleanly but the last logs to standard out / standard error are lost in teardown.

      For example, we run dumb-init. With debugging on, you can see it should write:

      DEBUG("Forwarded signal %d to children.\n", signum);
      

      If you attach strace to the process, you can see it clearly writes the text to stderr. But that message is lost and never is written to the sandbox 'stderr' file.

      We believe the issue starts here, in Docker executor.cpp:

        void shutdown(ExecutorDriver* driver)
        {
          cout << "Shutting down" << endl;
      
          if (run.isSome() && !killed) {
            // The docker daemon might still be in progress starting the
            // container, therefore we kill both the docker run process
            // and also ask the daemon to stop the container.
      
            // Making a mutable copy of the future so we can call discard.
            Future<Nothing>(run.get()).discard();
            stop = docker->stop(containerName, stopTimeout);
            killed = true;
          }
        }
      

      Notice how the "run" future is discarded before the Docker daemon is told to stop – now what will discarding it do?

      void commandDiscarded(const Subprocess& s, const string& cmd)
      {
        VLOG(1) << "'" << cmd << "' is being discarded";
        os::killtree(s.pid(), SIGKILL);
      }
      

      Oops, just sent SIGKILL to the entire process tree...

      You can see another (harmless?) side effect in the Docker daemon logs, it never gets a chance to kill the task:

      ERROR Handler for DELETE /v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233 returned error: No such container: mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
      

      I suspect that the fix is wait for 'docker->stop()' to complete before discarding the 'run' future.

      Happy to provide more information if necessary.

      Attachments

        Issue Links

          Activity

            People

              bmahler Benjamin Mahler
              stevenschlansker Steven Schlansker
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: