Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9518

can't use CGroups with YARN in centos7

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.7.7
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Flags:
      Patch, Important

      Description

      The os version is centos7. 

      cat /etc/redhat-release
      CentOS Linux release 7.3.1611 (Core)
      

      When I had set configuration variables  for cgroup with yarn, nodemanager could be start without any matter. But when I ran a job, the job failed with these exceptional nodemanager logs in the end.

      In these logs, the important logs is " Can't open file /sys/fs/cgroup/cpu as node manager - Is a directory "

      After I analysed, I found the reason. In centos6, the cgroup "cpu" and "cpuacct" subsystem are as follows: 

      /sys/fs/cgroup/cpu
      /sys/fs/cgroup/cpuacct
      

      But in centos7, as follows:

      /sys/fs/cgroup/cpu -> cpu,cpuacct
      /sys/fs/cgroup/cpuacct -> cpu,cpuacct
      /sys/fs/cgroup/cpu,cpuacct

      "cpu" and "cpuacct" have merge as "cpu,cpuacct".  "cpu"  and  "cpuacct"  are symbol links. 

      As I look at source code, nodemamager get the cgroup subsystem info by reading /proc/mounts. So It get the cpu and cpuacct subsystem path are also "/sys/fs/cgroup/cpu,cpuacct". 

      The resource description arguments of container-executor is such as follows: 

      cgroups=/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_000001/tasks
      

      There is a comma in the cgroup path, but the comma is separator of multi resource. Therefore, the cgroup path is truncated by container-executor as "/sys/fs/cgroup/cpu" rather than correct cgroup path " /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_000001/tasks " and report the error in the log  " Can't open file /sys/fs/cgroup/cpu as node manager - Is a directory "

      Hence I modify the source code and submit a patch. The idea of patch is that nodemanager get the cgroup cpu path as "/sys/fs/cgroup/cpu" rather than "/sys/fs/cgroup/cpu,cpuacct". As a result, the  resource description arguments of container-executor is such as follows: 

      cgroups=/sys/fs/cgroup/cpu/hadoop-yarn/container_1554210318404_0057_02_000001/tasks
      

      Note that there is no comma in the path, and is a valid path because "/sys/fs/cgroup/cpu" is symbol link to "/sys/fs/cgroup/cpu,cpuacct". 

      After applied the patch, the problem is resolved and the job can run successfully.

      The patch is compatible with  cgroup path of history os version such as centos6, centos7 , and universally applicable to cgroup subsystem paths such as cgroup network subsystem as follows:  

      /sys/fs/cgroup/net_cls -> net_cls,net_prio
      /sys/fs/cgroup/net_prio -> net_cls,net_prio
      /sys/fs/cgroup/net_cls,net_prio

       

       

      ##################################################################################################################################

      exceptional nodemanager logs:

      2019-04-19 20:17:20,095 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1554210318404_0042_01_000001 transitioned from LOCALIZED to RUNNING
      2019-04-19 20:17:20,101 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1554210318404_0042_01_000001 is : 27
      2019-04-19 20:17:20,103 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch with container ID: container_155421031840
      4_0042_01_000001 and exit code: 27
      ExitCodeException exitCode=27:
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
              at org.apache.hadoop.util.Shell.run(Shell.java:482)
              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      2019-04-19 20:17:20,108 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
      2019-04-19 20:17:20,108 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1554210318404_0042_01_000001
      2019-04-19 20:17:20,108 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 27
      2019-04-19 20:17:20,108 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: ExitCodeException exitCode=27:
      2019-04-19 20:17:20,108 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
      2019-04-19 20:17:20,108 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell.run(Shell.java:482)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.lang.Thread.run(Thread.java:745)
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell output: main : command provided 1
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is test_hadoop
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : requested yarn user is datadev
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup task files...
      2019-04-19 20:17:20,109 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Can't open file /sys/fs/cgroup/cpu as node manager - Is a directory
      2019-04-19 20:17:20,131 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 27
      2019-04-19 20:17:20,133 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1554210318404_0042_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
      2019-04-19 20:17:20,133 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1554210318404_0042_01_000001
       

        Attachments

        1. YARN-9518.patch
          1 kB
          Shurong Mai
        2. YARN-9518-branch-2.7.002.patch
          11 kB
          Jonathan Hung
        3. YARN-9518-branch-2.7.7.001.patch
          1 kB
          Shurong Mai
        4. YARN-9518-trunk.001.patch
          1 kB
          Shurong Mai

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                shurong.mai Shurong Mai
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: