Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8751

Container-executor permission check errors cause the NM to be marked unhealthy

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 3.2.0, 3.1.2
    • None

    Description

      ContainerLaunch (and ContainerRelaunch) contains logic to mark a NodeManager as UNHEALTHY if a ConfigurationException is thrown by ContainerLaunch#launchContainer (or relaunchContainer). The exception occurs based on the exit code returned by container-executor, and 7 different exit codes cause the NM to be marked UNHEALTHY.

      if (exitCode ==
          ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
          exitCode ==
              ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
          exitCode ==
              ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
          exitCode ==
              ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
          exitCode ==
              ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
          exitCode ==
              ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
          exitCode ==
              ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
        throw new ConfigurationException(
            "Linux Container Executor reached unrecoverable exception", e);

      I can understand why these are treated as fatal with the existing process container model. However, with privileged Docker containers this may be too harsh, as Privileged Docker containers don't guarantee the user's identity will be propagated into the container, so these mismatches can occur. Outside of privileged containers, an application may inadvertently change the permissions on one of these directories, triggering this condition.

      In our case, a container changed the "appcache/<appid>/<containerid>" directory permissions to 774. Some time later, the process in the container died and the Retry Policy kicked in to RELAUNCH the container. When the RELAUNCH occurred, container-executor checked the permissions of the "appcache/<appid>/<containerid>" directory (the existing workdir is retained for RELAUNCH) and returned exit code 35. Exit code 35 is COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all containers running on that node, when really only this container would have been impacted.

      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Container id: container_e15_1535130383425_0085_01_000005
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Exit code: 35
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch container failed
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not create container dirsCould not create local files and directories 5 6
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) -
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Shell output: main : command provided 4
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - main : run as user is user
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Creating script paths...
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Path /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_000005 has permission 774 but needs per
      mission 750.
      2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null)
      2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch (ContainerRelaunch.java:call(129)) - Failed to launch container due to configuration error.
      org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container Executor reached unrecoverable exception
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:633)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Relaunch container failed
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:987)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
              ... 8 more
      

      The root of the issue could be considered the fact that we can't guarantee which user is running in the container, and should eliminate writable mounts in this scenario. However, marking the NM unhealthy in all these cases does seem overkill.

      Opening this to discuss how we want to address this issue. jlowe ebadger Jim_Brennan eyang billie.rinaldi ccondit-target let me know your thoughts.

      Attachments

        1. YARN-8751.001.patch
          2 kB
          Craig Condit

        Issue Links

          Activity

            People

              ccondit-target Craig Condit
              shanekumpf@gmail.com Shane Kumpf
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: