Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8515

container-executor can crash with SIGPIPE after nodemanager restart

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      When running with docker on large clusters, we have noticed that sometimes docker containers are not removed - they remain in the exited state, and the corresponding container-executor is no longer running.  Upon investigation, we noticed that this always seemed to happen after a nodemanager restart.   The sequence leading to the stranded docker containers is:

      1. Nodemanager restarts
      2. Containers are recovered and then run for a while
      3. Containers are killed for some (legitimate) reason
      4. Container-executor exits without removing the docker container.

      After reproducing this on a test cluster, we found that the container-executor was exiting due to a SIGPIPE.

      What is happening is that the shell command executor that is used to start container-executor has threads reading from c-e's stdout and stderr.  When the NM is restarted, these threads are killed.  Then when the container-executor continues executing after the container exits with error, it tries to write to stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is not handled, this crashes the container-executor before it can actually remove the docker container.

      We ran into this in branch 2.8.  The way docker containers are removed has been completely redesigned in trunk, so I don't think it will lead to this exact failure, but after an NM restart, potentially any write to stderr or stdout in the container-executor could cause it to crash.

       

      Attachments

        1. YARN-8515.001.patch
          1 kB
          Jim Brennan

        Activity

          People

            jbrennan Jim Brennan
            jbrennan Jim Brennan
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: