Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4731

container-executor should not follow symlinks in recursive_unlink_children

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.9.0
    • Fix Version/s: 2.9.0, 3.0.0-alpha1, 2.8.2
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Enable LCE and CGroups
      Submit a mapreduce job

      2016-02-24 18:56:46,889 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_000001
      2016-02-24 18:56:46,894 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output:
      main : command provided 3
      main : run as user is dsperf
      main : requested yarn user is dsperf
      failed to rmdir job.jar: Not a directory
      Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_000001: 20 (Not a directory)
      Full command array for failed execution:
      [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, dsperf, dsperf, 3, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_000001]
      2016-02-24 18:56:46,894 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0003/container_e02_1456319010019_0003_01_000001 returned with exit code: 255
      org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=255:
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:199)
              at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:569)
              at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:265)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: ExitCodeException exitCode=255:
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
              at org.apache.hadoop.util.Shell.run(Shell.java:838)
              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
              ... 10 more
      
      

      As a result nodemanager-local directory are not getting deleted for each application

      total 36
      drwxr-s--- 4 hdfs hadoop 4096 Feb 25 08:25 ./
      drwxr-s--- 7 hdfs hadoop 4096 Feb 25 08:25 ../
      -rw------- 1 hdfs hadoop  340 Feb 25 08:25 container_tokens
      lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.jar -> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/11/job.jar/
      lrwxrwxrwx 1 hdfs hadoop  111 Feb 25 08:25 job.xml -> /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/hdfs/appcache/application_1456364845478_0004/filecache/13/job.xml*
      drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 jobSubmitDir/
      -rwx------ 1 hdfs hadoop 5348 Feb 25 08:25 launch_container.sh*
      drwxr-s--- 2 hdfs hadoop 4096 Feb 25 08:25 tmp/
      
      1. YARN-4731.001.patch
        2 kB
        Varun Vasudev
      2. YARN-4731.002.patch
        7 kB
        Colin P. McCabe

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          I committed this to branch-2.8 and branch-2.8.2 as well.

          Show
          jlowe Jason Lowe added a comment - I committed this to branch-2.8 and branch-2.8.2 as well.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Thanks for the reviews, guys.

          Show
          cmccabe Colin P. McCabe added a comment - Thanks for the reviews, guys.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #9392 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9392/)
          YARN-4731. container-executor should not follow symlinks in (jlowe: rev c58a6d53c58209a8f78ff64e04e9112933489fb5)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #9392 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9392/ ) YARN-4731 . container-executor should not follow symlinks in (jlowe: rev c58a6d53c58209a8f78ff64e04e9112933489fb5) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to Colin for the contribution and to Varun and Bibin for additional review! I committed this to trunk and branch-2.

          Show
          jlowe Jason Lowe added a comment - Thanks to Colin for the contribution and to Varun and Bibin for additional review! I committed this to trunk and branch-2.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for catching the vulnerability, Colin!

          +1 lgtm. Committing this.

          Show
          jlowe Jason Lowe added a comment - Thanks for catching the vulnerability, Colin! +1 lgtm. Committing this.
          Hide
          vvasudev Varun Vasudev added a comment -

          Thanks for pointing out the TOCTOU vulnerability in my patch Colin P. McCabe. I tried out the patch you uploaded and it looks good to me.

          +1.

          Show
          vvasudev Varun Vasudev added a comment - Thanks for pointing out the TOCTOU vulnerability in my patch Colin P. McCabe . I tried out the patch you uploaded and it looks good to me. +1.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 11m 2s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 45s trunk passed
          +1 compile 0m 21s trunk passed with JDK v1.8.0_72
          +1 compile 0m 25s trunk passed with JDK v1.7.0_95
          +1 mvnsite 0m 29s trunk passed
          +1 mvneclipse 0m 11s trunk passed
          +1 mvninstall 0m 24s the patch passed
          +1 compile 0m 25s the patch passed with JDK v1.8.0_72
          +1 cc 0m 25s the patch passed
          +1 javac 0m 25s the patch passed
          +1 compile 0m 23s the patch passed with JDK v1.7.0_95
          +1 cc 0m 23s the patch passed
          +1 javac 0m 23s the patch passed
          +1 mvnsite 0m 26s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 unit 9m 12s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72.
          +1 unit 9m 26s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95.
          +1 asflicense 0m 17s Patch does not generate ASF License warnings.
          40m 12s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12790248/YARN-4731.002.patch
          JIRA Issue YARN-4731
          Optional Tests asflicense compile cc mvnsite javac unit
          uname Linux 5c9152fd5ef1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / d1d4e16
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10655/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/10655/console
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 11m 2s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 45s trunk passed +1 compile 0m 21s trunk passed with JDK v1.8.0_72 +1 compile 0m 25s trunk passed with JDK v1.7.0_95 +1 mvnsite 0m 29s trunk passed +1 mvneclipse 0m 11s trunk passed +1 mvninstall 0m 24s the patch passed +1 compile 0m 25s the patch passed with JDK v1.8.0_72 +1 cc 0m 25s the patch passed +1 javac 0m 25s the patch passed +1 compile 0m 23s the patch passed with JDK v1.7.0_95 +1 cc 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 unit 9m 12s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. +1 unit 9m 26s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 17s Patch does not generate ASF License warnings. 40m 12s Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12790248/YARN-4731.002.patch JIRA Issue YARN-4731 Optional Tests asflicense compile cc mvnsite javac unit uname Linux 5c9152fd5ef1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d1d4e16 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10655/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/10655/console Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Here is a patch which changes recursive_unlink_children to skip removing symlinks. It doesn't open up a TOCTOU security issue, since it opens files with O_NOFOLLOW after doing the symlink check. I added a unit test case for recursive_unlink_children.

          Show
          cmccabe Colin P. McCabe added a comment - Here is a patch which changes recursive_unlink_children to skip removing symlinks. It doesn't open up a TOCTOU security issue, since it opens files with O_NOFOLLOW after doing the symlink check. I added a unit test case for recursive_unlink_children .
          Hide
          cmccabe Colin P. McCabe added a comment - - edited

          Thanks for finding this bug. Unfortunately, I think the patch has some issues... it introduces a race condition where the path could change during our traversal.

          The issue with patch v1 is a TOCTOU (time of check / time of use) race condition. Here is one example:
          1. container-executor checks /foo to make sure that it's not a symlink; it isn't
          2. An attacker moves /foo out of the way and re-creates /foo as a symlink to /etc
          3. container-executor deletes /foo (which is really actually /etc at this point)

          The v2 version I posted avoids this race condition by using O_NOFOLLOW to open the files in step 3.

          Also, one note: we should also be using the dirfd and name, not fullpath. "fullpath" is purely provided for debugging and log messages. The directory could be renamed while we're traversing it; we don't want the removal to fail in this case.

          Show
          cmccabe Colin P. McCabe added a comment - - edited Thanks for finding this bug. Unfortunately, I think the patch has some issues... it introduces a race condition where the path could change during our traversal. The issue with patch v1 is a TOCTOU (time of check / time of use) race condition. Here is one example: 1. container-executor checks /foo to make sure that it's not a symlink; it isn't 2. An attacker moves /foo out of the way and re-creates /foo as a symlink to /etc 3. container-executor deletes /foo (which is really actually /etc at this point) The v2 version I posted avoids this race condition by using O_NOFOLLOW to open the files in step 3. Also, one note: we should also be using the dirfd and name , not fullpath . "fullpath" is purely provided for debugging and log messages. The directory could be renamed while we're traversing it; we don't want the removal to fail in this case.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Jason Lowe/Varun Vasudev

          Tried same scenarios in branch 2.7.2 signal error doesn't exists.

          Signal error and container initialization exception are not causing any task failure .
          If any scope for improvement we can raise a new jira no need to handle as part of this jira.
          Localization issue fixed ..+1 (non-binding)

          Show
          bibinchundatt Bibin A Chundatt added a comment - Jason Lowe / Varun Vasudev Tried same scenarios in branch 2.7.2 signal error doesn't exists. Signal error and container initialization exception are not causing any task failure . If any scope for improvement we can raise a new jira no need to handle as part of this jira. Localization issue fixed ..+1 (non-binding)
          Hide
          vvasudev Varun Vasudev added a comment -

          The signal container exception and the container initalization error can be ignored. The signal container exception is due to the fact that we call signal container as part of the container cleanup and the container initialization error is due to the MR AM killing the last reducer.

          Show
          vvasudev Varun Vasudev added a comment - The signal container exception and the container initalization error can be ignored. The signal container exception is due to the fact that we call signal container as part of the container cleanup and the container initialization error is due to the MR AM killing the last reducer.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the report, Bibin, and the patch, Varun!

          This was triggered by YARN-4594. My apologies for missing it during that review, as I had forgotten about the symlinks in the container directory.

          +1 patch looks good to me. Pinging Colin P. McCabe in case he has time to take a look as well.

          Signal to container is throwing throwing exception in LCE

          I don't believe that is related since YARN-4594 didn't modify the signal path IIRC. I think we should address that as a separate JIRA. Is the signal issue something happening in 2.8 or earlier?

          Show
          jlowe Jason Lowe added a comment - Thanks for the report, Bibin, and the patch, Varun! This was triggered by YARN-4594 . My apologies for missing it during that review, as I had forgotten about the symlinks in the container directory. +1 patch looks good to me. Pinging Colin P. McCabe in case he has time to take a look as well. Signal to container is throwing throwing exception in LCE I don't believe that is related since YARN-4594 didn't modify the signal path IIRC. I think we should address that as a separate JIRA. Is the signal issue something happening in 2.8 or earlier?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 11s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 6m 36s trunk passed
          +1 compile 0m 22s trunk passed with JDK v1.8.0_72
          +1 compile 0m 25s trunk passed with JDK v1.7.0_95
          +1 mvnsite 0m 27s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 mvninstall 0m 24s the patch passed
          +1 compile 0m 19s the patch passed with JDK v1.8.0_72
          -1 cc 8m 58s hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_72 with JDK v1.8.0_72 generated 1 new + 2 unchanged - 1 fixed = 3 total (was 3)
          +1 cc 0m 19s the patch passed
          +1 javac 0m 19s the patch passed
          +1 compile 0m 23s the patch passed with JDK v1.7.0_95
          +1 cc 0m 23s the patch passed
          +1 javac 0m 23s the patch passed
          +1 mvnsite 0m 25s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 unit 8m 39s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72.
          +1 unit 9m 13s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95.
          +1 asflicense 0m 19s Patch does not generate ASF License warnings.
          28m 23s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12789913/YARN-4731.001.patch
          JIRA Issue YARN-4731
          Optional Tests asflicense compile cc mvnsite javac unit
          uname Linux a92ee9de4749 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 6979cbf
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          cc hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_72: https://builds.apache.org/job/PreCommit-YARN-Build/10635/artifact/patchprocess/diff-compile-cc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_72.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10635/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/10635/console
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 36s trunk passed +1 compile 0m 22s trunk passed with JDK v1.8.0_72 +1 compile 0m 25s trunk passed with JDK v1.7.0_95 +1 mvnsite 0m 27s trunk passed +1 mvneclipse 0m 13s trunk passed +1 mvninstall 0m 24s the patch passed +1 compile 0m 19s the patch passed with JDK v1.8.0_72 -1 cc 8m 58s hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_72 with JDK v1.8.0_72 generated 1 new + 2 unchanged - 1 fixed = 3 total (was 3) +1 cc 0m 19s the patch passed +1 javac 0m 19s the patch passed +1 compile 0m 23s the patch passed with JDK v1.7.0_95 +1 cc 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 mvnsite 0m 25s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 unit 8m 39s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_72. +1 unit 9m 13s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 19s Patch does not generate ASF License warnings. 28m 23s Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12789913/YARN-4731.001.patch JIRA Issue YARN-4731 Optional Tests asflicense compile cc mvnsite javac unit uname Linux a92ee9de4749 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6979cbf Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 cc hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_72: https://builds.apache.org/job/PreCommit-YARN-Build/10635/artifact/patchprocess/diff-compile-cc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_72.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10635/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/10635/console Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Varun Vasudev
          I have check the patch attached

          Issue 1

          1. Container localization files are getting deleted properly

          Issues that still exists

          1. Signal to container is throwing throwing exception in LCE
          2016-02-25 13:08:20,442 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: Using container runtime: DefaultLinuxContainerRuntime
          2016-02-25 13:08:20,447 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 9. Privileged Execution Operation Output:
          main : command provided 2
          main : run as user is yarn
          main : requested yarn user is yarn
          Full command array for failed execution:
          [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, yarn, yarn, 2, 23524, 9]
          2016-02-25 13:08:20,447 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Signal container failed. Exception:
          org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=9:
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
                  at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
                  at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor$DelayedProcessKiller.run(ContainerExecutor.java:532)
          Caused by: ExitCodeException exitCode=9:
                  at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
                  at org.apache.hadoop.util.Shell.run(Shell.java:838)
                  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
                  ... 4 more
          
          
          1. Container initalization error was thrown
          2016-02-25 13:08:20,183 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: Using container runtime: DefaultLinuxContainerRuntime
          2016-02-25 13:08:20,191 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 143. Privileged Execution Operation Output:
          main : command provided 1
          main : run as user is yarn
          main : requested yarn user is yarn
          Getting exit code file...
          Creating script paths...
          Writing pid file...
          Writing to tmp file /opt/bibin/dsperf/HAINSTALL/nmlocal/nmPrivate/application_1456385661741_0001/container_1456385661741_0001_01_000003/container_1456385661741_0001_01_000003.pid.tmp
          Writing to cgroup task files...
          Creating local dirs...
          Launching container...
          Getting exit code file...
          Creating script paths...
          Full command array for failed execution:
          [nice, -n, 0, /opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, yarn, yarn, 1, application_1456385661741_0001, container_1456385661741_0001_01_000003, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/yarn/appcache/application_1456385661741_0001/container_1456385661741_0001_01_000003, /opt/bibin/dsperf/HAINSTALL/nmlocal/nmPrivate/application_1456385661741_0001/container_1456385661741_0001_01_000003/launch_container.sh, /opt/bibin/dsperf/HAINSTALL/nmlocal/nmPrivate/application_1456385661741_0001/container_1456385661741_0001_01_000003/container_1456385661741_0001_01_000003.tokens, /opt/bibin/dsperf/HAINSTALL/nmlocal/nmPrivate/application_1456385661741_0001/container_1456385661741_0001_01_000003/container_1456385661741_0001_01_000003.pid, /opt/bibin/dsperf/HAINSTALL/nmlocal, /opt/bibin/dsperf/HAINSTALL/nmlog, cgroups=/cgroups/cpu/hadoop-yarn/container_1456385661741_0001_01_000003/tasks]
          2016-02-25 13:08:20,191 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Launch container failed. Exception:
          org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=143:
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:103)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:100)
                  at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:408)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:319)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:88)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                  at java.lang.Thread.run(Thread.java:745)
          Caused by: ExitCodeException exitCode=143:
                  at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
                  at org.apache.hadoop.util.Shell.run(Shell.java:838)
                  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
                  ... 9 more
          2016-02-25 13:08:20,192 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1456385661741_0001_01_000003 is : 143
          
          

          Varun Vasudev : Should i raise separate jira for the same or will handle as part of the same jira ?

          Show
          bibinchundatt Bibin A Chundatt added a comment - Varun Vasudev I have check the patch attached Issue 1 Container localization files are getting deleted properly Issues that still exists Signal to container is throwing throwing exception in LCE 2016-02-25 13:08:20,442 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: Using container runtime: DefaultLinuxContainerRuntime 2016-02-25 13:08:20,447 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 9. Privileged Execution Operation Output: main : command provided 2 main : run as user is yarn main : requested yarn user is yarn Full command array for failed execution: [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, yarn, yarn, 2, 23524, 9] 2016-02-25 13:08:20,447 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Signal container failed. Exception: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=9: at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor$DelayedProcessKiller.run(ContainerExecutor.java:532) Caused by: ExitCodeException exitCode=9: at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) at org.apache.hadoop.util.Shell.run(Shell.java:838) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) ... 4 more Container initalization error was thrown 2016-02-25 13:08:20,183 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: Using container runtime: DefaultLinuxContainerRuntime 2016-02-25 13:08:20,191 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 143. Privileged Execution Operation Output: main : command provided 1 main : run as user is yarn main : requested yarn user is yarn Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /opt/bibin/dsperf/HAINSTALL/nmlocal/nmPrivate/application_1456385661741_0001/container_1456385661741_0001_01_000003/container_1456385661741_0001_01_000003.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... Getting exit code file... Creating script paths... Full command array for failed execution: [nice, -n, 0, /opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, yarn, yarn, 1, application_1456385661741_0001, container_1456385661741_0001_01_000003, /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/yarn/appcache/application_1456385661741_0001/container_1456385661741_0001_01_000003, /opt/bibin/dsperf/HAINSTALL/nmlocal/nmPrivate/application_1456385661741_0001/container_1456385661741_0001_01_000003/launch_container.sh, /opt/bibin/dsperf/HAINSTALL/nmlocal/nmPrivate/application_1456385661741_0001/container_1456385661741_0001_01_000003/container_1456385661741_0001_01_000003.tokens, /opt/bibin/dsperf/HAINSTALL/nmlocal/nmPrivate/application_1456385661741_0001/container_1456385661741_0001_01_000003/container_1456385661741_0001_01_000003.pid, /opt/bibin/dsperf/HAINSTALL/nmlocal, /opt/bibin/dsperf/HAINSTALL/nmlog, cgroups=/cgroups/cpu/hadoop-yarn/container_1456385661741_0001_01_000003/tasks] 2016-02-25 13:08:20,191 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Launch container failed. Exception: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=143: at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:103) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:100) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:408) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:319) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:88) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: ExitCodeException exitCode=143: at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) at org.apache.hadoop.util.Shell.run(Shell.java:838) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) ... 9 more 2016-02-25 13:08:20,192 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code from container container_1456385661741_0001_01_000003 is : 143 Varun Vasudev : Should i raise separate jira for the same or will handle as part of the same jira ?
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Varun Vasudev

          Thank you for attaching patch for the same. Will try patch attached

          Show
          bibinchundatt Bibin A Chundatt added a comment - Varun Vasudev Thank you for attaching patch for the same. Will try patch attached
          Hide
          vvasudev Varun Vasudev added a comment -

          Root cause here is that we are using fstat on an open fd. The open call follows the symlink and we stat the directory pointed to by the symlink instead of the actual symlink. As a result rmdir fails because it doesn't delete symlinks.

          The other problem with following symlinks in our case is that we end up deleting public resources because we use symlinks in the container work dir to point to the actual resources.

          I've attached a patch to not follow symlinks and just call unlink on the symlink itself.

          Jason Lowe - can you please take a look?

          Show
          vvasudev Varun Vasudev added a comment - Root cause here is that we are using fstat on an open fd. The open call follows the symlink and we stat the directory pointed to by the symlink instead of the actual symlink. As a result rmdir fails because it doesn't delete symlinks. The other problem with following symlinks in our case is that we end up deleting public resources because we use symlinks in the container work dir to point to the actual resources. I've attached a patch to not follow symlinks and just call unlink on the symlink itself. Jason Lowe - can you please take a look?
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          On Signal operation for container also the below exception is thrown

          2016-02-25 10:49:57,704 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: Using container runtime: DefaultLinuxContainerRuntime
          2016-02-25 10:49:57,709 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 9. Privileged Execution Operation Output:
          main : command provided 2
          main : run as user is hdfs
          main : requested yarn user is hdfs
          Full command array for failed execution:
          [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, hdfs, hdfs, 2, 4850, 15]
          2016-02-25 10:49:57,710 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Signal container failed. Exception:
          org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=9:
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109)
                  at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
                  at java.lang.Thread.run(Thread.java:745)
          Caused by: ExitCodeException exitCode=9:
                  at org.apache.hadoop.util.Shell.runCommand(Shell.java:927)
                  at org.apache.hadoop.util.Shell.run(Shell.java:838)
                  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150)
                  ... 9 more
          
          Show
          bibinchundatt Bibin A Chundatt added a comment - On Signal operation for container also the below exception is thrown 2016-02-25 10:49:57,704 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime: Using container runtime: DefaultLinuxContainerRuntime 2016-02-25 10:49:57,709 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 9. Privileged Execution Operation Output: main : command provided 2 main : run as user is hdfs main : requested yarn user is hdfs Full command array for failed execution: [/opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor, hdfs, hdfs, 2, 4850, 15] 2016-02-25 10:49:57,710 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime: Signal container failed. Exception: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=9: at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:173) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:132) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:109) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:513) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:520) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:139) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:55) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) Caused by: ExitCodeException exitCode=9: at org.apache.hadoop.util.Shell.runCommand(Shell.java:927) at org.apache.hadoop.util.Shell.run(Shell.java:838) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1117) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:150) ... 9 more
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Command array logs
          /opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor dsperf dsperf 3 /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0002/container_e02_1456319010019_0002_01_000001
          main : command provided 3
          main : run as user is dsperf
          main : requested yarn user is dsperf
          failed to rmdir job.jar: Not a directory
          Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0002/container_e02_1456319010019_0002_01_000001: 20 (Not a directory)

          Show
          bibinchundatt Bibin A Chundatt added a comment - Command array logs /opt/bibin/dsperf/HAINSTALL/install/hadoop/nodemanager/bin/container-executor dsperf dsperf 3 /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0002/container_e02_1456319010019_0002_01_000001 main : command provided 3 main : run as user is dsperf main : requested yarn user is dsperf failed to rmdir job.jar: Not a directory Error while deleting /opt/bibin/dsperf/HAINSTALL/nmlocal/usercache/dsperf/appcache/application_1456319010019_0002/container_e02_1456319010019_0002_01_000001: 20 (Not a directory)

            People

            • Assignee:
              cmccabe Colin P. McCabe
              Reporter:
              bibinchundatt Bibin A Chundatt
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development