Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-543 [Umbrella] NodeManager localization related issues
  3. YARN-2902

Killing a container that is localizing can orphan resources in the DOWNLOADING state

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero.

      1. YARN-2902.patch
        8 kB
        Varun Saxena
      2. YARN-2902.002.patch
        9 kB
        Varun Saxena
      3. YARN-2902.03.patch
        29 kB
        Varun Saxena
      4. YARN-2902.04.patch
        26 kB
        Varun Saxena
      5. YARN-2902.05.patch
        108 kB
        Varun Saxena
      6. YARN-2902.06.patch
        113 kB
        Varun Saxena
      7. YARN-2902.07.patch
        79 kB
        Varun Saxena
      8. YARN-2902.08.patch
        33 kB
        Varun Saxena
      9. YARN-2902.09.patch
        22 kB
        Varun Saxena
      10. YARN-2902.10.patch
        25 kB
        Varun Saxena
      11. YARN-2902.11.patch
        25 kB
        Varun Saxena
      12. YARN-2902-branch-2.6.01.patch
        23 kB
        Varun Saxena

        Issue Links

          Activity

          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          Oops yes branch-2.6 was the broken one, it has a separate patch. I will port only the ENOENT change to branch-2.6
          thank you for confirming this.

          Opened: YARN-6056 and will provide a patch there

          Show
          wilfreds Wilfred Spiegelenburg added a comment - Oops yes branch-2.6 was the broken one, it has a separate patch. I will port only the ENOENT change to branch-2.6 thank you for confirming this. Opened: YARN-6056 and will provide a patch there
          Hide
          jlowe Jason Lowe added a comment -

          In trunk we "ignore" a non existing directory in delete_as_user() whan the stat returns an ENOENT we do not do that in branch-2. I tracked it back to the backport of this jira into branch-2. The native code part of the change was not ported back into branch-2.

          I'm not seeing that. The "backport" for branch-2 was a cherry-pick:

          commit c75d8b164f05033d0c6ed4876cf0b8d57224aa8c
          Author: Jason Lowe <jlowe@apache.org>
          Date:   Thu Oct 29 16:33:29 2015 +0000
          
              YARN-2902. Killing a container that is localizing can orphan resources in the DOWNLOADING state. Contributed by Varun Saxena
              (cherry picked from commit e2267de2076245bd9857f6a30e3c731df017fef8)
          

          And the diff between trunk and branch-2 shows no difference at all in the delete_as_user function.

          Ah, I believe you mean branch-2.6 instead of branch-2. There is a huge difference between trunk and branch-2.6 for delete_as_user, and I agree it appears the ENOENT change was dropped in the 2.6 patch.

          If it was an oversight does this require a new jira to backport the native code or does it get handled as an addendum to this one.

          It would be a new JIRA against 2.6.4. This change has already shipped.

          Show
          jlowe Jason Lowe added a comment - In trunk we "ignore" a non existing directory in delete_as_user() whan the stat returns an ENOENT we do not do that in branch-2. I tracked it back to the backport of this jira into branch-2. The native code part of the change was not ported back into branch-2. I'm not seeing that. The "backport" for branch-2 was a cherry-pick: commit c75d8b164f05033d0c6ed4876cf0b8d57224aa8c Author: Jason Lowe <jlowe@apache.org> Date: Thu Oct 29 16:33:29 2015 +0000 YARN-2902. Killing a container that is localizing can orphan resources in the DOWNLOADING state. Contributed by Varun Saxena (cherry picked from commit e2267de2076245bd9857f6a30e3c731df017fef8) And the diff between trunk and branch-2 shows no difference at all in the delete_as_user function. Ah, I believe you mean branch-2.6 instead of branch-2. There is a huge difference between trunk and branch-2.6 for delete_as_user, and I agree it appears the ENOENT change was dropped in the 2.6 patch. If it was an oversight does this require a new jira to backport the native code or does it get handled as an addendum to this one. It would be a new JIRA against 2.6.4. This change has already shipped.
          Hide
          wilfreds Wilfred Spiegelenburg added a comment -

          I was looking at a problem around removing non existing directories in the LCE (native code) and saw that there was a difference in behaviour between trunk and branch-2.
          In trunk we "ignore" a non existing directory in delete_as_user() whan the stat returns an ENOENT we do not do that in branch-2. I tracked it back to the backport of this jira into branch-2. The native code part of the change was not ported back into branch-2.

          Question is was that on purpose or was it an oversight? If it was an oversight does this require a new jira to backport the native code or does it get handled as an addendum to this one. It has been a while since this jira was closed and the jira has been in a number of releases.

          Show
          wilfreds Wilfred Spiegelenburg added a comment - I was looking at a problem around removing non existing directories in the LCE (native code) and saw that there was a difference in behaviour between trunk and branch-2. In trunk we "ignore" a non existing directory in delete_as_user() whan the stat returns an ENOENT we do not do that in branch-2. I tracked it back to the backport of this jira into branch-2. The native code part of the change was not ported back into branch-2. Question is was that on purpose or was it an oversight? If it was an oversight does this require a new jira to backport the native code or does it get handled as an addendum to this one. It has been a while since this jira was closed and the jira has been in a number of releases.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9060 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9060/)
          Add YARN-2975, YARN-3893, YARN-2902 and YARN-4354 to Release 2.6.4 entry (junping_du: rev b6c9d3fab9c76b03abd664858f64a4ebf3c2bb20)

          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9060 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9060/ ) Add YARN-2975 , YARN-3893 , YARN-2902 and YARN-4354 to Release 2.6.4 entry (junping_du: rev b6c9d3fab9c76b03abd664858f64a4ebf3c2bb20) hadoop-yarn-project/CHANGES.txt
          Hide
          djp Junping Du added a comment -

          Thanks Varun Saxena for the patch against 2.6 branch. Agree that the whitespace issue is not related to this patch. I have merge it to branch-2.6.

          Show
          djp Junping Du added a comment - Thanks Varun Saxena for the patch against 2.6 branch. Agree that the whitespace issue is not related to this patch. I have merge it to branch-2.6.
          Hide
          varun_saxena Varun Saxena added a comment -

          The whitespace result seems to be weird. Maybe because some build related changes made in trunk are not in this branch

          Show
          varun_saxena Varun Saxena added a comment - The whitespace result seems to be weird. Maybe because some build related changes made in trunk are not in this branch
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 8m 23s branch-2.6 passed
          +1 compile 0m 24s branch-2.6 passed with JDK v1.8.0_66
          +1 compile 0m 25s branch-2.6 passed with JDK v1.7.0_91
          +1 checkstyle 0m 15s branch-2.6 passed
          +1 mvnsite 0m 29s branch-2.6 passed
          +1 mvneclipse 0m 16s branch-2.6 passed
          +1 findbugs 0m 56s branch-2.6 passed
          -1 javadoc 0m 19s hadoop-yarn-server-nodemanager in branch-2.6 failed with JDK v1.8.0_66.
          +1 javadoc 0m 20s branch-2.6 passed with JDK v1.7.0_91
          +1 mvninstall 0m 23s the patch passed
          +1 compile 0m 20s the patch passed with JDK v1.8.0_66
          +1 javac 0m 20s the patch passed
          +1 compile 0m 24s the patch passed with JDK v1.7.0_91
          +1 javac 0m 24s the patch passed
          +1 checkstyle 0m 10s the patch passed
          +1 mvnsite 0m 25s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          -1 whitespace 0m 2s The patch has 4758 line(s) that end in whitespace. Use git apply --whitespace=fix.
          -1 whitespace 1m 48s The patch has 272 line(s) with tabs.
          +1 findbugs 0m 56s the patch passed
          -1 javadoc 0m 15s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66.
          +1 javadoc 0m 17s the patch passed with JDK v1.7.0_91
          -1 unit 6m 0s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66.
          +1 unit 6m 0s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91.
          -1 asflicense 0m 32s Patch generated 75 ASF License warnings.
          31m 7s



          Reason Tests
          JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:date2016-01-06
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12780377/YARN-2902-branch-2.6.01.patch
          JIRA Issue YARN-2902
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b384d6abd028 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.6 / 74027c2
          Default Java 1.7.0_91
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91
          findbugs v1.3.9
          javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/whitespace-eol.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/whitespace-tabs.txt
          javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10173/testReport/
          asflicense https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Max memory used 70MB
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/10173/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 8m 23s branch-2.6 passed +1 compile 0m 24s branch-2.6 passed with JDK v1.8.0_66 +1 compile 0m 25s branch-2.6 passed with JDK v1.7.0_91 +1 checkstyle 0m 15s branch-2.6 passed +1 mvnsite 0m 29s branch-2.6 passed +1 mvneclipse 0m 16s branch-2.6 passed +1 findbugs 0m 56s branch-2.6 passed -1 javadoc 0m 19s hadoop-yarn-server-nodemanager in branch-2.6 failed with JDK v1.8.0_66. +1 javadoc 0m 20s branch-2.6 passed with JDK v1.7.0_91 +1 mvninstall 0m 23s the patch passed +1 compile 0m 20s the patch passed with JDK v1.8.0_66 +1 javac 0m 20s the patch passed +1 compile 0m 24s the patch passed with JDK v1.7.0_91 +1 javac 0m 24s the patch passed +1 checkstyle 0m 10s the patch passed +1 mvnsite 0m 25s the patch passed +1 mvneclipse 0m 10s the patch passed -1 whitespace 0m 2s The patch has 4758 line(s) that end in whitespace. Use git apply --whitespace=fix. -1 whitespace 1m 48s The patch has 272 line(s) with tabs. +1 findbugs 0m 56s the patch passed -1 javadoc 0m 15s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. +1 javadoc 0m 17s the patch passed with JDK v1.7.0_91 -1 unit 6m 0s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. +1 unit 6m 0s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. -1 asflicense 0m 32s Patch generated 75 ASF License warnings. 31m 7s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService Subsystem Report/Notes Docker Image:yetus/hadoop:date2016-01-06 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12780377/YARN-2902-branch-2.6.01.patch JIRA Issue YARN-2902 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b384d6abd028 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.6 / 74027c2 Default Java 1.7.0_91 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 findbugs v1.3.9 javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/whitespace-eol.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/whitespace-tabs.txt javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10173/testReport/ asflicense https://builds.apache.org/job/PreCommit-YARN-Build/10173/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Max memory used 70MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/10173/console This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 9m 0s branch-2.6 passed
          +1 compile 0m 27s branch-2.6 passed with JDK v1.8.0_66
          +1 compile 0m 30s branch-2.6 passed with JDK v1.7.0_91
          +1 checkstyle 0m 16s branch-2.6 passed
          +1 mvnsite 0m 33s branch-2.6 passed
          +1 mvneclipse 0m 17s branch-2.6 passed
          +1 findbugs 1m 1s branch-2.6 passed
          -1 javadoc 0m 21s hadoop-yarn-server-nodemanager in branch-2.6 failed with JDK v1.8.0_66.
          +1 javadoc 0m 21s branch-2.6 passed with JDK v1.7.0_91
          +1 mvninstall 0m 27s the patch passed
          +1 compile 0m 28s the patch passed with JDK v1.8.0_66
          +1 javac 0m 28s the patch passed
          +1 compile 0m 27s the patch passed with JDK v1.7.0_91
          +1 javac 0m 27s the patch passed
          +1 checkstyle 0m 12s the patch passed
          +1 mvnsite 0m 28s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          -1 whitespace 0m 1s The patch has 4465 line(s) that end in whitespace. Use git apply --whitespace=fix.
          -1 whitespace 1m 59s The patch has 272 line(s) with tabs.
          +1 findbugs 1m 1s the patch passed
          -1 javadoc 0m 21s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66.
          +1 javadoc 0m 22s the patch passed with JDK v1.7.0_91
          +1 unit 6m 4s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66.
          +1 unit 6m 8s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91.
          -1 asflicense 0m 32s Patch generated 75 ASF License warnings.
          46m 39s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:date2016-01-06
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12780377/YARN-2902-branch-2.6.01.patch
          JIRA Issue YARN-2902
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 9e273af6efbe 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.6 / 74027c2
          Default Java 1.7.0_91
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91
          findbugs v1.3.9
          javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/whitespace-eol.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/whitespace-tabs.txt
          javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
          JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10169/testReport/
          asflicense https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/patch-asflicense-problems.txt
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Max memory used 71MB
          Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/10169/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 9m 0s branch-2.6 passed +1 compile 0m 27s branch-2.6 passed with JDK v1.8.0_66 +1 compile 0m 30s branch-2.6 passed with JDK v1.7.0_91 +1 checkstyle 0m 16s branch-2.6 passed +1 mvnsite 0m 33s branch-2.6 passed +1 mvneclipse 0m 17s branch-2.6 passed +1 findbugs 1m 1s branch-2.6 passed -1 javadoc 0m 21s hadoop-yarn-server-nodemanager in branch-2.6 failed with JDK v1.8.0_66. +1 javadoc 0m 21s branch-2.6 passed with JDK v1.7.0_91 +1 mvninstall 0m 27s the patch passed +1 compile 0m 28s the patch passed with JDK v1.8.0_66 +1 javac 0m 28s the patch passed +1 compile 0m 27s the patch passed with JDK v1.7.0_91 +1 javac 0m 27s the patch passed +1 checkstyle 0m 12s the patch passed +1 mvnsite 0m 28s the patch passed +1 mvneclipse 0m 12s the patch passed -1 whitespace 0m 1s The patch has 4465 line(s) that end in whitespace. Use git apply --whitespace=fix. -1 whitespace 1m 59s The patch has 272 line(s) with tabs. +1 findbugs 1m 1s the patch passed -1 javadoc 0m 21s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. +1 javadoc 0m 22s the patch passed with JDK v1.7.0_91 +1 unit 6m 4s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. +1 unit 6m 8s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. -1 asflicense 0m 32s Patch generated 75 ASF License warnings. 46m 39s Subsystem Report/Notes Docker Image:yetus/hadoop:date2016-01-06 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12780377/YARN-2902-branch-2.6.01.patch JIRA Issue YARN-2902 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 9e273af6efbe 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.6 / 74027c2 Default Java 1.7.0_91 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 findbugs v1.3.9 javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/whitespace-eol.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/whitespace-tabs.txt javadoc https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10169/testReport/ asflicense https://builds.apache.org/job/PreCommit-YARN-Build/10169/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Max memory used 71MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/10169/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Reopening issue to run QA for branch-2.6 patch

          Show
          varun_saxena Varun Saxena added a comment - Reopening issue to run QA for branch-2.6 patch
          Hide
          djp Junping Du added a comment -

          Sounds like a good plan. Thanks!

          Show
          djp Junping Du added a comment - Sounds like a good plan. Thanks!
          Hide
          varun_saxena Varun Saxena added a comment -

          Does that means if we need to pull that fix, the patch here need to be updated?

          Yes, the patch would have to be updated with changes in container-executor.c

          Yeah there were a few conflicts. Will reopen and run QA after decision on YARN-3089.

          Show
          varun_saxena Varun Saxena added a comment - Does that means if we need to pull that fix, the patch here need to be updated? Yes, the patch would have to be updated with changes in container-executor.c Yeah there were a few conflicts. Will reopen and run QA after decision on YARN-3089 .
          Hide
          djp Junping Du added a comment -

          However, YARN-3089 is unlikely to be backported to branch-2.6

          I just pinged the author/review on that JIRA. Let's wait and see. Does that means if we need to pull that fix, the patch here need to be updated? If so, we should wait a while on that patch.

          Show
          djp Junping Du added a comment - However, YARN-3089 is unlikely to be backported to branch-2.6 I just pinged the author/review on that JIRA. Let's wait and see. Does that means if we need to pull that fix, the patch here need to be updated? If so, we should wait a while on that patch.
          Hide
          djp Junping Du added a comment -

          Thanks Varun Saxena for replying and delivering the patch!

          Also should I reopen this JIRA to run QA on this patch?

          It depends on if we have many conflicts in merging with original patch. If answer is yes, then we'd better to reopen the JIRA and run Jenkins test.

          Show
          djp Junping Du added a comment - Thanks Varun Saxena for replying and delivering the patch! Also should I reopen this JIRA to run QA on this patch? It depends on if we have many conflicts in merging with original patch. If answer is yes, then we'd better to reopen the JIRA and run Jenkins test.
          Hide
          varun_saxena Varun Saxena added a comment -

          Junping Du, uploaded a patch backporting changes to branch-2.6
          Changes in container-executor.c are not required as the change was necessitated by changes made in YARN-3089 which is not there in branch-2.6
          If YARN-3089 is ever brought into branch-2.6, it will break fix made by this JIRA(while using LCE).
          Would a comment on that JIRA be enough to avoid this potential problem ?
          However, YARN-3089 is unlikely to be backported to branch-2.6

          We must backport YARN-4354(as you mentioned) and YARN-4380(the newly added test here fails intermittently without it) if we get this into branch-2.6

          Also should I reopen this JIRA to run QA on this patch ?

          Show
          varun_saxena Varun Saxena added a comment - Junping Du , uploaded a patch backporting changes to branch-2.6 Changes in container-executor.c are not required as the change was necessitated by changes made in YARN-3089 which is not there in branch-2.6 If YARN-3089 is ever brought into branch-2.6, it will break fix made by this JIRA(while using LCE). Would a comment on that JIRA be enough to avoid this potential problem ? However, YARN-3089 is unlikely to be backported to branch-2.6 We must backport YARN-4354 (as you mentioned) and YARN-4380 (the newly added test here fails intermittently without it) if we get this into branch-2.6 Also should I reopen this JIRA to run QA on this patch ?
          Hide
          varun_saxena Varun Saxena added a comment -

          Yes this issue should exist in 2.6 too.
          I can help you with backporting it.

          Show
          varun_saxena Varun Saxena added a comment - Yes this issue should exist in 2.6 too. I can help you with backporting it.
          Hide
          djp Junping Du added a comment -

          Hi Jason Lowe and Varun Saxena, shall we get the fix (and YARN-4354) to branch-2.6 also?

          Show
          djp Junping Du added a comment - Hi Jason Lowe and Varun Saxena , shall we get the fix (and YARN-4354 ) to branch-2.6 also?
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2489 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2489/)
          YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2489 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2489/ ) YARN-2902 . Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #551 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/551/)
          YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #551 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/551/ ) YARN-2902 . Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks Jason Lowe for the review and commit.

          Show
          varun_saxena Varun Saxena added a comment - Thanks Jason Lowe for the review and commit.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #613 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/613/)
          YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #613 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/613/ ) YARN-2902 . Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #1336 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1336/)
          YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #1336 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1336/ ) YARN-2902 . Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2543 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2543/)
          YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2543 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2543/ ) YARN-2902 . Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #601 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/601/)
          YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #601 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/601/ ) YARN-2902 . Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Varun! I committed this to trunk, branch-2, and branch-2.7.

          Show
          jlowe Jason Lowe added a comment - Thanks, Varun! I committed this to trunk, branch-2, and branch-2.7.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8724 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8724/)
          YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8724 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8724/ ) YARN-2902 . Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          Hide
          jlowe Jason Lowe added a comment -

          +1 latest patch lgtm. Committing this.

          Show
          jlowe Jason Lowe added a comment - +1 latest patch lgtm. Committing this.
          Hide
          varun_saxena Varun Saxena added a comment -

          Fixed.

          Show
          varun_saxena Varun Saxena added a comment - Fixed.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s docker + precommit patch detected.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
          +1 mvninstall 3m 10s trunk passed
          +1 compile 0m 21s trunk passed with JDK v1.8.0_66
          +1 compile 0m 22s trunk passed with JDK v1.7.0_85
          +1 checkstyle 0m 11s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 0m 53s trunk passed
          +1 javadoc 0m 18s trunk passed with JDK v1.8.0_66
          +1 javadoc 0m 22s trunk passed with JDK v1.7.0_85
          +1 mvninstall 0m 21s the patch passed
          +1 compile 0m 21s the patch passed with JDK v1.8.0_66
          +1 cc 0m 21s the patch passed
          +1 javac 0m 21s the patch passed
          +1 compile 0m 23s the patch passed with JDK v1.7.0_85
          +1 cc 0m 23s the patch passed
          +1 javac 0m 23s the patch passed
          +1 checkstyle 0m 10s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 1s the patch passed
          +1 javadoc 0m 18s the patch passed with JDK v1.8.0_66
          +1 javadoc 0m 21s the patch passed with JDK v1.7.0_85
          +1 unit 8m 33s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66.
          +1 unit 8m 58s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85.
          +1 asflicense 0m 22s Patch does not generate ASF License warnings.
          28m 1s



          Subsystem Report/Notes
          Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-10-29
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12769486/YARN-2902.11.patch
          JIRA Issue YARN-2902
          Optional Tests asflicense javac javadoc mvninstall unit findbugs checkstyle compile cc
          uname Linux bf5ccb831f38 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-67f42f1/precommit/personality/hadoop.sh
          git revision trunk / c416999
          Default Java 1.7.0_85
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_85
          findbugs v3.0.0
          JDK v1.7.0_85 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9601/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Max memory used 224MB
          Powered by Apache Yetus http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9601/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s docker + precommit patch detected. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 3m 10s trunk passed +1 compile 0m 21s trunk passed with JDK v1.8.0_66 +1 compile 0m 22s trunk passed with JDK v1.7.0_85 +1 checkstyle 0m 11s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 53s trunk passed +1 javadoc 0m 18s trunk passed with JDK v1.8.0_66 +1 javadoc 0m 22s trunk passed with JDK v1.7.0_85 +1 mvninstall 0m 21s the patch passed +1 compile 0m 21s the patch passed with JDK v1.8.0_66 +1 cc 0m 21s the patch passed +1 javac 0m 21s the patch passed +1 compile 0m 23s the patch passed with JDK v1.7.0_85 +1 cc 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 checkstyle 0m 10s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 1s the patch passed +1 javadoc 0m 18s the patch passed with JDK v1.8.0_66 +1 javadoc 0m 21s the patch passed with JDK v1.7.0_85 +1 unit 8m 33s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. +1 unit 8m 58s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85. +1 asflicense 0m 22s Patch does not generate ASF License warnings. 28m 1s Subsystem Report/Notes Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-10-29 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12769486/YARN-2902.11.patch JIRA Issue YARN-2902 Optional Tests asflicense javac javadoc mvninstall unit findbugs checkstyle compile cc uname Linux bf5ccb831f38 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-67f42f1/precommit/personality/hadoop.sh git revision trunk / c416999 Default Java 1.7.0_85 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_85 findbugs v3.0.0 JDK v1.7.0_85 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9601/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Max memory used 224MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9601/console This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch, Varun! Patch looks great except for one thing that I missed earlier. In container-executor.c the code is assuming any error returned by stat must mean the file is missing, and that's not correct. It should be checking errno for ENOENT, as anything else is an error not related to the file being already deleted.

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch, Varun! Patch looks great except for one thing that I missed earlier. In container-executor.c the code is assuming any error returned by stat must mean the file is missing, and that's not correct. It should be checking errno for ENOENT, as anything else is an error not related to the file being already deleted.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe,
          For some reason, QA report was not published to JIRA. After integration with Yetus it seems.
          Its successful though. Kindly find the reports here. https://builds.apache.org/job/PreCommit-YARN-Build/9593/console

          +1 overall

          Vote Subsystem Runtime Comment
          0 reexec 0m 8s docker + precommit patch detected.
          +1 @author 0m 0s The patch does not contain any @author
                tags.
          +1 test4tests 0m 0s The patch appears to include 3 new or
                modified test files.
          +1 mvninstall 3m 22s trunk passed
          +1 compile 0m 24s trunk passed with JDK v1.8.0_60
          +1 compile 0m 25s trunk passed with JDK v1.7.0_79
          +1 checkstyle 0m 12s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 findbugs 1m 3s trunk passed
          +1 javadoc 0m 24s trunk passed with JDK v1.8.0_60
          +1 javadoc 0m 27s trunk passed with JDK v1.7.0_79
          +1 mvninstall 0m 27s the patch passed
          +1 compile 0m 26s the patch passed with JDK v1.8.0_60
          +1 cc 0m 26s the patch passed
          +1 javac 0m 26s the patch passed
          +1 compile 0m 25s the patch passed with JDK v1.7.0_79
          +1 cc 0m 25s the patch passed
          +1 javac 0m 25s the patch passed
          +1 checkstyle 0m 13s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 12s the patch passed
          +1 javadoc 0m 22s the patch passed with JDK v1.8.0_60
          +1 javadoc 0m 26s the patch passed with JDK v1.7.0_79
          +1 unit 9m 18s hadoop-yarn-server-nodemanager in the
                patch passed with JDK v1.8.0_60.
          +1 unit 9m 29s hadoop-yarn-server-nodemanager in the
                patch passed with JDK v1.7.0_79.
          +1 asflicense 0m 25s Patch does not generate ASF License
                warnings.
              30m 42s
          Subsystem Report/Notes
          Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-10-28
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12769215/YARN-2902.10.patch
          JIRA Issue YARN-2902
          Optional Tests asflicense javac javadoc mvninstall unit findbugs checkstyle compile cc
          uname Linux 0bcb1b5c92b2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-06655ab/dev-support/personality/hadoop.sh
          git revision trunk / a04b169
          Default Java 1.7.0_79
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_60 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79
          findbugs v3.0.0
          JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9593/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Max memory used 227MB
          Powered by Apache Yetus http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9593/console

          ============================================================================
          ============================================================================
          Finished build.
          ============================================================================
          ============================================================================

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , For some reason, QA report was not published to JIRA. After integration with Yetus it seems. Its successful though. Kindly find the reports here. https://builds.apache.org/job/PreCommit-YARN-Build/9593/console +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 8s docker + precommit patch detected. +1 @author 0m 0s The patch does not contain any @author       tags. +1 test4tests 0m 0s The patch appears to include 3 new or       modified test files. +1 mvninstall 3m 22s trunk passed +1 compile 0m 24s trunk passed with JDK v1.8.0_60 +1 compile 0m 25s trunk passed with JDK v1.7.0_79 +1 checkstyle 0m 12s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 24s trunk passed with JDK v1.8.0_60 +1 javadoc 0m 27s trunk passed with JDK v1.7.0_79 +1 mvninstall 0m 27s the patch passed +1 compile 0m 26s the patch passed with JDK v1.8.0_60 +1 cc 0m 26s the patch passed +1 javac 0m 26s the patch passed +1 compile 0m 25s the patch passed with JDK v1.7.0_79 +1 cc 0m 25s the patch passed +1 javac 0m 25s the patch passed +1 checkstyle 0m 13s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 12s the patch passed +1 javadoc 0m 22s the patch passed with JDK v1.8.0_60 +1 javadoc 0m 26s the patch passed with JDK v1.7.0_79 +1 unit 9m 18s hadoop-yarn-server-nodemanager in the       patch passed with JDK v1.8.0_60. +1 unit 9m 29s hadoop-yarn-server-nodemanager in the       patch passed with JDK v1.7.0_79. +1 asflicense 0m 25s Patch does not generate ASF License       warnings.     30m 42s Subsystem Report/Notes Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-10-28 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12769215/YARN-2902.10.patch JIRA Issue YARN-2902 Optional Tests asflicense javac javadoc mvninstall unit findbugs checkstyle compile cc uname Linux 0bcb1b5c92b2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-06655ab/dev-support/personality/hadoop.sh git revision trunk / a04b169 Default Java 1.7.0_79 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_60 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79 findbugs v3.0.0 JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9593/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Max memory used 227MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9593/console ============================================================================ ============================================================================ Finished build. ============================================================================ ============================================================================
          Hide
          varun_saxena Varun Saxena added a comment -

          Findbugs and test failures are related. Attaching another patch

          Show
          varun_saxena Varun Saxena added a comment - Findbugs and test failures are related. Attaching another patch
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 18m 57s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 9m 1s There were no new javac warning messages.
          +1 javadoc 11m 28s There were no new javadoc warning messages.
          +1 release audit 0m 28s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 42s There were no new checkstyle issues.
          +1 whitespace 0m 6s The patch has no lines that end in whitespace.
          +1 install 1m 43s mvn install still works.
          +1 eclipse:eclipse 0m 37s The patch built with eclipse:eclipse.
          -1 findbugs 1m 25s The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 9m 21s Tests failed in hadoop-yarn-server-nodemanager.
              53m 52s  



          Reason Tests
          FindBugs module:hadoop-yarn-server-nodemanager
          Failed unit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
            hadoop.yarn.server.nodemanager.containermanager.localizer.TestLocalResourcesTrackerImpl



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12769131/YARN-2902.09.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 68ce93c
          Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9592/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9592/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9592/testReport/
          Java 1.7.0_55
          uname Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9592/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 18m 57s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 9m 1s There were no new javac warning messages. +1 javadoc 11m 28s There were no new javadoc warning messages. +1 release audit 0m 28s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 42s There were no new checkstyle issues. +1 whitespace 0m 6s The patch has no lines that end in whitespace. +1 install 1m 43s mvn install still works. +1 eclipse:eclipse 0m 37s The patch built with eclipse:eclipse. -1 findbugs 1m 25s The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings. -1 yarn tests 9m 21s Tests failed in hadoop-yarn-server-nodemanager.     53m 52s   Reason Tests FindBugs module:hadoop-yarn-server-nodemanager Failed unit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService   hadoop.yarn.server.nodemanager.containermanager.localizer.TestLocalResourcesTrackerImpl Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12769131/YARN-2902.09.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 68ce93c Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9592/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9592/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9592/testReport/ Java 1.7.0_55 uname Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9592/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, kindly review. The patch applies on branch-2.7 too.

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , kindly review. The patch applies on branch-2.7 too.
          Hide
          jlowe Jason Lowe added a comment -

          I think in LocalResourcesTrackerImpl#handle, after handling RELEASE event, we should check if the reference count is 0 and whether state of resource is DOWNLOADING. And if this is so, call LocalResourcesTrackerImpl#removeResource.

          Agreed. We can automatically remove the resource if the refcount of a downloaded resource ever goes to zero. And if there's a race where another container is just trying to reference that resource just as we're releasing (and removing) it from a killed container then either we'll keep it because the refcount is nonzero (request comes before release) or we'll create a new resource to track the subsequent request (release comes before request).

          Show
          jlowe Jason Lowe added a comment - I think in LocalResourcesTrackerImpl#handle, after handling RELEASE event, we should check if the reference count is 0 and whether state of resource is DOWNLOADING. And if this is so, call LocalResourcesTrackerImpl#removeResource. Agreed. We can automatically remove the resource if the refcount of a downloaded resource ever goes to zero. And if there's a race where another container is just trying to reference that resource just as we're releasing (and removing) it from a killed container then either we'll keep it because the refcount is nonzero (request comes before release) or we'll create a new resource to track the subsequent request (release comes before request).
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks a lot Jason Lowe for the review.
          I was under the incorrect impression that the resource downloading will not be taken up by other containers again. You are correct we should not FAIL the resource here. It will be taken up by outstanding container when next HB comes. If we do not call handleDownloadingRsrcsOnCleanup, we wont require to synchronize scheduled map as well.

          Also event.getResource().getLocalPath() can be used here too. This would preclude the need for ScheduledResource class and hence the refactoring associated with it.

          However, as resource would not be explicitly FAILED in this case, we should probably do some cleanup when reference count of downloading resource becomes 0. Otherwise entry associated with the downloading resource will remain in LocalResourcesTrackerImpl#localResourceMap and this may show up when cache cleanup is done.
          And we may turn up with the same log LOG.error("Attempt to remove resource: " + rsrc + " with non-zero refcount"); even though the resource is deleted on disk.
          I think in LocalResourcesTrackerImpl#handle, after handling RELEASE event, we should check if the reference count is 0 and whether state of resource is DOWNLOADING. And if this is so, call LocalResourcesTrackerImpl#removeResource.
          Thoughts ?

          Show
          varun_saxena Varun Saxena added a comment - Thanks a lot Jason Lowe for the review. I was under the incorrect impression that the resource downloading will not be taken up by other containers again. You are correct we should not FAIL the resource here. It will be taken up by outstanding container when next HB comes. If we do not call handleDownloadingRsrcsOnCleanup, we wont require to synchronize scheduled map as well. Also event.getResource().getLocalPath() can be used here too. This would preclude the need for ScheduledResource class and hence the refactoring associated with it. However, as resource would not be explicitly FAILED in this case, we should probably do some cleanup when reference count of downloading resource becomes 0. Otherwise entry associated with the downloading resource will remain in LocalResourcesTrackerImpl#localResourceMap and this may show up when cache cleanup is done. And we may turn up with the same log LOG.error("Attempt to remove resource: " + rsrc + " with non-zero refcount"); even though the resource is deleted on disk. I think in LocalResourcesTrackerImpl#handle, after handling RELEASE event, we should check if the reference count is 0 and whether state of resource is DOWNLOADING. And if this is so, call LocalResourcesTrackerImpl#removeResource. Thoughts ?
          Hide
          jlowe Jason Lowe added a comment -

          Forgot to respond to this comment:

          That is if NM recovery is not enabled and the deletion task is scheduled. But the deletion task is put in the deletion service's executor queue because all the 4 threads in deletion service's executor(NM delete threads) are occupied. If NM goes down before this task is taken up, the downloading resources wont be deleted.

          If NM recovery is not enabled then failing to delete when the NM crashes is already a known issue. As for the normal termination scenario we should be stopping the ResourceLocalizationService (via the ContainerManager shutdown) before trying to stop the DeletionService, so I would expect deletions to be queued up before we stop that service.

          Show
          jlowe Jason Lowe added a comment - Forgot to respond to this comment: That is if NM recovery is not enabled and the deletion task is scheduled. But the deletion task is put in the deletion service's executor queue because all the 4 threads in deletion service's executor(NM delete threads) are occupied. If NM goes down before this task is taken up, the downloading resources wont be deleted. If NM recovery is not enabled then failing to delete when the NM crashes is already a known issue. As for the normal termination scenario we should be stopping the ResourceLocalizationService (via the ContainerManager shutdown) before trying to stop the DeletionService, so I would expect deletions to be queued up before we stop that service.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch, Varun!

          I'm not sure the logic in handleDownloadingRsrcsOnCleanup is desired. IIUC when a container is killed then all resources that were still pending will receive a resource failed event. Those resources will in turn send all containers that reference it a failed resource message, and that will effectively kill the container. This could be bad if the killed container was localizing a resource for more than one container. Today when the container is killed I believe another outstanding container will simply acquire the lock on the scheduled resource and proceed to re-attempt a localization on that resource. If we do this logic then we'll kill all those other referencing containers which isn't correct. We do need to cleanup downloading resource objects that have a zero refcount, but not ones that are still referenced by other containers. Those other containers should be able to continue localizing the resource if necessary.

          With the following code, don't we risk failing to put the resource on the scheduled event queue if something is thrown yet we will still hold the resource lock? In the old code we would always put the resource on the scheduled queue before fielding any errors, and the scheduled queue processing would release the resource lock.

                    if (nRsrc.tryAcquire()) {
                      if (nRsrc.getState() == ResourceState.DOWNLOADING) {
                        [...some code removed for brevity...]
                        try {
                          Path localizationPath = getPathForLocalization(next);
                          synchronized (scheduled) {
                            scheduled.put(nextRsrc,
                                new ScheduledResource(evt, localizationPath));
                          }
                          return NodeManagerBuilderUtils.newResourceLocalizationSpec(next,
                              localizationPath);
                        } catch (IOException e) {
                          LOG.error("local path for PRIVATE localization could not be " +
                              "found. Disks might have failed.", e);
                        } catch (IllegalArgumentException e) {
                          LOG.error("Inorrect path for PRIVATE localization."
                              + next.getResource().getFile(), e);
                        } catch (URISyntaxException e) {
                          LOG.error(
                              "Got exception in parsing URL of LocalResource:"
                                  + next.getResource(), e);
                        }
                      } else {
                        // Need to release acquired lock
                        nRsrc.unlock();
                      }
                    }
          

          Actually do we even need all the code refactoring associated with the new ScheduledResource class? I think we can simply use the local path already accessible via event.getResource().getLocalPath() in the cleanup code. If that's not null then that means we sent the localizer that resource in the heartbeat response, and those would be the ones to target for cleanup.

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch, Varun! I'm not sure the logic in handleDownloadingRsrcsOnCleanup is desired. IIUC when a container is killed then all resources that were still pending will receive a resource failed event. Those resources will in turn send all containers that reference it a failed resource message, and that will effectively kill the container. This could be bad if the killed container was localizing a resource for more than one container. Today when the container is killed I believe another outstanding container will simply acquire the lock on the scheduled resource and proceed to re-attempt a localization on that resource. If we do this logic then we'll kill all those other referencing containers which isn't correct. We do need to cleanup downloading resource objects that have a zero refcount, but not ones that are still referenced by other containers. Those other containers should be able to continue localizing the resource if necessary. With the following code, don't we risk failing to put the resource on the scheduled event queue if something is thrown yet we will still hold the resource lock? In the old code we would always put the resource on the scheduled queue before fielding any errors, and the scheduled queue processing would release the resource lock. if (nRsrc.tryAcquire()) { if (nRsrc.getState() == ResourceState.DOWNLOADING) { [...some code removed for brevity...] try { Path localizationPath = getPathForLocalization(next); synchronized (scheduled) { scheduled.put(nextRsrc, new ScheduledResource(evt, localizationPath)); } return NodeManagerBuilderUtils.newResourceLocalizationSpec(next, localizationPath); } catch (IOException e) { LOG.error( "local path for PRIVATE localization could not be " + "found. Disks might have failed." , e); } catch (IllegalArgumentException e) { LOG.error( "Inorrect path for PRIVATE localization." + next.getResource().getFile(), e); } catch (URISyntaxException e) { LOG.error( "Got exception in parsing URL of LocalResource:" + next.getResource(), e); } } else { // Need to release acquired lock nRsrc.unlock(); } } Actually do we even need all the code refactoring associated with the new ScheduledResource class? I think we can simply use the local path already accessible via event.getResource().getLocalPath() in the cleanup code. If that's not null then that means we sent the localizer that resource in the heartbeat response, and those would be the ones to target for cleanup.
          Hide
          varun_saxena Varun Saxena added a comment -

          TestNodeStatusUpdaterForLabels failure will be tracked by YARN-4169

          Show
          varun_saxena Varun Saxena added a comment - TestNodeStatusUpdaterForLabels failure will be tracked by YARN-4169
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 pre-patch 19m 23s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 8m 51s There were no new javac warning messages.
          +1 javadoc 11m 16s There were no new javadoc warning messages.
          +1 release audit 0m 27s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 0m 46s The applied patch generated 6 new checkstyle issues (total was 235, now 198).
          +1 whitespace 0m 12s The patch has no lines that end in whitespace.
          +1 install 1m 42s mvn install still works.
          +1 eclipse:eclipse 0m 38s The patch built with eclipse:eclipse.
          +1 findbugs 1m 22s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 9m 5s Tests failed in hadoop-yarn-server-nodemanager.
              53m 48s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12768305/YARN-2902.08.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 124a412
          Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9547/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9547/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 19m 23s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 8m 51s There were no new javac warning messages. +1 javadoc 11m 16s There were no new javadoc warning messages. +1 release audit 0m 27s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 46s The applied patch generated 6 new checkstyle issues (total was 235, now 198). +1 whitespace 0m 12s The patch has no lines that end in whitespace. +1 install 1m 42s mvn install still works. +1 eclipse:eclipse 0m 38s The patch built with eclipse:eclipse. +1 findbugs 1m 22s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 9m 5s Tests failed in hadoop-yarn-server-nodemanager.     53m 48s   Reason Tests Failed unit tests hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12768305/YARN-2902.08.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 124a412 Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9547/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9547/console This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 pre-patch 18m 3s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 9m 4s There were no new javac warning messages.
          +1 javadoc 11m 49s There were no new javadoc warning messages.
          +1 release audit 0m 26s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 0m 44s The applied patch generated 6 new checkstyle issues (total was 235, now 198).
          +1 whitespace 0m 12s The patch has no lines that end in whitespace.
          +1 install 1m 41s mvn install still works.
          +1 eclipse:eclipse 0m 38s The patch built with eclipse:eclipse.
          +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 9m 24s Tests passed in hadoop-yarn-server-nodemanager.
              53m 30s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12768301/YARN-2902.08.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 124a412
          Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9546/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9546/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 18m 3s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 9m 4s There were no new javac warning messages. +1 javadoc 11m 49s There were no new javadoc warning messages. +1 release audit 0m 26s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 44s The applied patch generated 6 new checkstyle issues (total was 235, now 198). +1 whitespace 0m 12s The patch has no lines that end in whitespace. +1 install 1m 41s mvn install still works. +1 eclipse:eclipse 0m 38s The patch built with eclipse:eclipse. +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 9m 24s Tests passed in hadoop-yarn-server-nodemanager.     53m 30s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12768301/YARN-2902.08.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 124a412 Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9546/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9546/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, kindly review.
          Sorry could not upload the patch earlier due to bandwidth issues. But I think its still on track for 2.7.2

          Coming to the patch, the patch handles the deletion in NM itself. At the time of processing container cleanup event(after killing of container), we will transition downloading resource to FAILED.
          And after localizer exits, deletion will be done in finally block of LocalizerRunner, as per suggestion given above.

          There is one presumably rare scenario where this deletion wont work. That is if NM recovery is not enabled and the deletion task is scheduled. But the deletion task is put in the deletion service's executor queue because all the 4 threads in deletion service's executor(NM delete threads) are occupied. If NM goes down before this task is taken up, the downloading resources wont be deleted.

          If you want this handled, we can attempt deletion in container localizer too. I already have code for it(in earlier patches). But do we need to handle this rare case ? Let me know.
          BTW, patch does not apply cleanly on branch-2.7 so will update that patch once trunk patch is ok to go in.

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , kindly review. Sorry could not upload the patch earlier due to bandwidth issues. But I think its still on track for 2.7.2 Coming to the patch, the patch handles the deletion in NM itself. At the time of processing container cleanup event(after killing of container), we will transition downloading resource to FAILED. And after localizer exits, deletion will be done in finally block of LocalizerRunner, as per suggestion given above. There is one presumably rare scenario where this deletion wont work. That is if NM recovery is not enabled and the deletion task is scheduled. But the deletion task is put in the deletion service's executor queue because all the 4 threads in deletion service's executor(NM delete threads) are occupied. If NM goes down before this task is taken up, the downloading resources wont be deleted. If you want this handled, we can attempt deletion in container localizer too. I already have code for it(in earlier patches). But do we need to handle this rare case ? Let me know. BTW, patch does not apply cleanly on branch-2.7 so will update that patch once trunk patch is ok to go in.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe,

          The key with the new proposal is that the LocalizerRunner thread is the one issuing the deletes and only after the localizer process exits.

          +1. This makes sense. I cant think of any case where this won't work.

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , The key with the new proposal is that the LocalizerRunner thread is the one issuing the deletes and only after the localizer process exits. +1. This makes sense. I cant think of any case where this won't work.
          Hide
          jlowe Jason Lowe added a comment -

          The key with the new proposal is that the LocalizerRunner thread is the one issuing the deletes and only after the localizer process exits.

          2. Container is killed. Associated resources are stuck in downloading state and a deletion task is launched for them.

          That deletion task would not be launched because the localizer has not exited. The LocalizerRunner will still be waiting on the localizer process to exit.

          Localizer doesnt exit immediately as of now when container is killed, even though we interrupt the thread.

          Yes, and that's fine. We won't issue the deletion requests until the localizer process eventually exits.

          The key is this code in LocalizerRunner:

          LocalizerRunner
              public void run() {
                Path nmPrivateCTokensPath = null;
                Throwable exception = null;
                try {
          [...localizer pre-startup code removed for brevity...]
                  if (dirsHandler.areDisksHealthy()) {
                    exec.startLocalizer(new LocalizerStartContext.Builder()
                        .setNmPrivateContainerTokens(nmPrivateCTokensPath)
                        .setNmAddr(localizationServerAddress)
                        .setUser(context.getUser())
                        .setAppId(ConverterUtils.toString(context.getContainerId()
                            .getApplicationAttemptId().getApplicationId()))
                        .setLocId(localizerId)
                        .setDirsHandler(dirsHandler)
                        .build());
                  } else {
                    throw new IOException("All disks failed. "
                        + dirsHandler.getDisksHealthReport(false));
                  }
                // TODO handle ExitCodeException separately?
                } catch (FSError fe) {
                  exception = fe;
                } catch (Exception e) {
                  exception = e;
                } finally {
                  if (exception != null) {
                    LOG.info("Localizer failed", exception);
                    // On error, report failure to Container and signal ABORT
                    // Notify resource of failed localization
                    ContainerId cId = context.getContainerId();
                    dispatcher.getEventHandler().handle(new ContainerResourceFailedEvent(
                        cId, null, exception.getMessage()));
                  }
                  for (LocalizerResourceRequestEvent event : scheduled.values()) {
                    event.getResource().unlock();
                  }
                  delService.delete(null, nmPrivateCTokensPath, new Path[] {});
                }
          

          startLocalizer won't return until the localizer process exits, so when it iterates the scheduled map in the finally block to unlock the resources we can issue deletions for the local resource paths at the same time.

          Show
          jlowe Jason Lowe added a comment - The key with the new proposal is that the LocalizerRunner thread is the one issuing the deletes and only after the localizer process exits. 2. Container is killed. Associated resources are stuck in downloading state and a deletion task is launched for them. That deletion task would not be launched because the localizer has not exited. The LocalizerRunner will still be waiting on the localizer process to exit. Localizer doesnt exit immediately as of now when container is killed, even though we interrupt the thread. Yes, and that's fine. We won't issue the deletion requests until the localizer process eventually exits. The key is this code in LocalizerRunner: LocalizerRunner public void run() { Path nmPrivateCTokensPath = null ; Throwable exception = null ; try { [...localizer pre-startup code removed for brevity...] if (dirsHandler.areDisksHealthy()) { exec.startLocalizer( new LocalizerStartContext.Builder() .setNmPrivateContainerTokens(nmPrivateCTokensPath) .setNmAddr(localizationServerAddress) .setUser(context.getUser()) .setAppId(ConverterUtils.toString(context.getContainerId() .getApplicationAttemptId().getApplicationId())) .setLocId(localizerId) .setDirsHandler(dirsHandler) .build()); } else { throw new IOException( "All disks failed. " + dirsHandler.getDisksHealthReport( false )); } // TODO handle ExitCodeException separately? } catch (FSError fe) { exception = fe; } catch (Exception e) { exception = e; } finally { if (exception != null ) { LOG.info( "Localizer failed" , exception); // On error, report failure to Container and signal ABORT // Notify resource of failed localization ContainerId cId = context.getContainerId(); dispatcher.getEventHandler().handle( new ContainerResourceFailedEvent( cId, null , exception.getMessage())); } for (LocalizerResourceRequestEvent event : scheduled.values()) { event.getResource().unlock(); } delService.delete( null , nmPrivateCTokensPath, new Path[] {}); } startLocalizer won't return until the localizer process exits, so when it iterates the scheduled map in the finally block to unlock the resources we can issue deletions for the local resource paths at the same time.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe,

          We also don't need delayed deletion support in DeletionService, since we know the container localizer process is dead.

          Localizer doesnt exit immediately as of now when container is killed, even though we interrupt the thread. We first issue a DIE on next HB and then only localizer exits.
          From the time container is killed and localizer exits, a resource maybe downloaded or may start downloading. We will delete the tmp directory and main directory.
          But if download is started by localizer in the meantime, it will recreate the directories (in FSDownload#call).
          That is why delay has been introduced. Thoughts ?

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , We also don't need delayed deletion support in DeletionService, since we know the container localizer process is dead. Localizer doesnt exit immediately as of now when container is killed, even though we interrupt the thread. We first issue a DIE on next HB and then only localizer exits. From the time container is killed and localizer exits, a resource maybe downloaded or may start downloading. We will delete the tmp directory and main directory. But if download is started by localizer in the meantime, it will recreate the directories (in FSDownload#call). That is why delay has been introduced. Thoughts ?
          Hide
          varun_saxena Varun Saxena added a comment -

          Just to let you know, one case where this wont work, I mean after removal of flag from protocol.

          1. NM recovery is disabled.
          2. Container is killed. Associated resources are stuck in downloading state and a deletion task is launched for them.
          3. In the meantime localizer downloads a resource and on next HB, Localizer reports a downloaded resource to NM. In NM this will be in downloading state.
          4. NM indicates localizer to DIE. Localizer wont delete the resource just downloaded.
          5. NM crashes.
          6. NM would missing deleting the downloading resource as well as recovery is disabled.

          This I agree though should be a very rare scenario and we can skip it.

          Show
          varun_saxena Varun Saxena added a comment - Just to let you know, one case where this wont work, I mean after removal of flag from protocol. 1. NM recovery is disabled. 2. Container is killed. Associated resources are stuck in downloading state and a deletion task is launched for them. 3. In the meantime localizer downloads a resource and on next HB, Localizer reports a downloaded resource to NM. In NM this will be in downloading state. 4. NM indicates localizer to DIE. Localizer wont delete the resource just downloaded. 5. NM crashes. 6. NM would missing deleting the downloading resource as well as recovery is disabled. This I agree though should be a very rare scenario and we can skip it.
          Hide
          varun_saxena Varun Saxena added a comment -

          Sorry I mean NM crashes and recovery is not enabled.

          Show
          varun_saxena Varun Saxena added a comment - Sorry I mean NM crashes and recovery is not enabled.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, thanks for looking at the patch.

          The reason I inserted delete downloading flag in the protocol was to indicate to localizer that the resources it reported to NM in last HB were not processed by NM. So localizer needs to delete them. That is why an extra list of paths was maintained in localizer(paths which have been reported to NM for download).
          I was primarily working on the principle that we can delete as much as we can in localizer. So that if NM crashes and its not work preserving, paths can be deleted. And vice versa. So 2 points of deletion can make it almost sure that downloading resources are deleted.

          But yeah this does make it complex.

          You are correct that NM will know about these paths as well and can delete them. The extra flag in localizer protocol thus can be removed.

          I will update the patch.

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , thanks for looking at the patch. The reason I inserted delete downloading flag in the protocol was to indicate to localizer that the resources it reported to NM in last HB were not processed by NM. So localizer needs to delete them. That is why an extra list of paths was maintained in localizer(paths which have been reported to NM for download). I was primarily working on the principle that we can delete as much as we can in localizer. So that if NM crashes and its not work preserving, paths can be deleted. And vice versa. So 2 points of deletion can make it almost sure that downloading resources are deleted. But yeah this does make it complex. You are correct that NM will know about these paths as well and can delete them. The extra flag in localizer protocol thus can be removed. I will update the patch.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch, Varun!

          Sorry, I'm still a little confused on why we need to complicate the localizer protocol to fix this issue. Seems like this is a hack to help the NM figure out what's going on, but it should already know this stuff. That prompted me to dig around for an alternative solution, and I think I found one.

          The NM knows the local path where a resource is localized, since it tells the localizer where to put it in the download request. Also each localizer has a LocalizerRunner thread that is tracking it, and it knows which resources were pending when the localizer process exits. That's tracked in the scheduled map so the runner thread can unlock every pending resource to allow a subsequent localizer to try downloading it again. Seems to me all we need to do is have the LocalizerRunner issue a delete of the local path and temporary download path for each resource that was pending at the time the localizer process died, since we know any pending resources when a localizer exits must have been orphaned. Resources that were successfully localized are pulled out of the scheduled map, so the only things left should be the ones we need to process for cleanup.

          That seems like a much simpler implementation as it doesn't change any protocols and doesn't rely on the container localizer doing any cleanup. The NM will automatically do so when it exits. We also don't need delayed deletion support in DeletionService, since we know the container localizer process is dead.

          Maybe I'm missing something and that approach can't work. If it can then that seems like a preferable solution for 2.7 as it will be a smaller, simpler patch.

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch, Varun! Sorry, I'm still a little confused on why we need to complicate the localizer protocol to fix this issue. Seems like this is a hack to help the NM figure out what's going on, but it should already know this stuff. That prompted me to dig around for an alternative solution, and I think I found one. The NM knows the local path where a resource is localized, since it tells the localizer where to put it in the download request. Also each localizer has a LocalizerRunner thread that is tracking it, and it knows which resources were pending when the localizer process exits. That's tracked in the scheduled map so the runner thread can unlock every pending resource to allow a subsequent localizer to try downloading it again. Seems to me all we need to do is have the LocalizerRunner issue a delete of the local path and temporary download path for each resource that was pending at the time the localizer process died, since we know any pending resources when a localizer exits must have been orphaned. Resources that were successfully localized are pulled out of the scheduled map, so the only things left should be the ones we need to process for cleanup. That seems like a much simpler implementation as it doesn't change any protocols and doesn't rely on the container localizer doing any cleanup. The NM will automatically do so when it exits. We also don't need delayed deletion support in DeletionService, since we know the container localizer process is dead. Maybe I'm missing something and that approach can't work. If it can then that seems like a preferable solution for 2.7 as it will be a smaller, simpler patch.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe kindly review

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe kindly review
          Hide
          varun_saxena Varun Saxena added a comment -

          The new patch does the following over the last patch :

          1. Removed additional param in container executor to ignore missing directory. Now this will be the default behaviour
          2. The patch no longer cancels the deletion task in NM
          3. Localizer wont send the extra heartbeat which it was sending to NM if NM had indicated it to DIE.
          4. Do not wait in localizer for cancelled task to complete on DIE.
          Show
          varun_saxena Varun Saxena added a comment - The new patch does the following over the last patch : Removed additional param in container executor to ignore missing directory. Now this will be the default behaviour The patch no longer cancels the deletion task in NM Localizer wont send the extra heartbeat which it was sending to NM if NM had indicated it to DIE. Do not wait in localizer for cancelled task to complete on DIE.
          Hide
          varun_saxena Varun Saxena added a comment -

          Release audit warning is unrelated. There is HDFS-9182 already for it.
          Checkstyle is for file length > 2000.
          Raised YARN-4223 for pre patch findbugs warning

          Show
          varun_saxena Varun Saxena added a comment - Release audit warning is unrelated. There is HDFS-9182 already for it. Checkstyle is for file length > 2000. Raised YARN-4223 for pre patch findbugs warning
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 pre-patch 19m 52s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 7 new or modified test files.
          +1 javac 7m 58s There were no new javac warning messages.
          +1 javadoc 10m 20s There were no new javadoc warning messages.
          -1 release audit 0m 17s The applied patch generated 1 release audit warnings.
          -1 checkstyle 2m 4s The applied patch generated 1 new checkstyle issues (total was 211, now 211).
          +1 whitespace 2m 33s The patch has no lines that end in whitespace.
          +1 install 1m 40s mvn install still works.
          +1 eclipse:eclipse 0m 36s The patch built with eclipse:eclipse.
          +1 findbugs 4m 39s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 0m 23s Tests passed in hadoop-yarn-api.
          +1 yarn tests 2m 2s Tests passed in hadoop-yarn-common.
          +1 yarn tests 8m 47s Tests passed in hadoop-yarn-server-nodemanager.
              62m 6s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12764917/YARN-2902.07.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 3b85bd7
          Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html
          Release Audit https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/patchReleaseAuditProblems.txt
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9342/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9342/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 pre-patch 19m 52s Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 7 new or modified test files. +1 javac 7m 58s There were no new javac warning messages. +1 javadoc 10m 20s There were no new javadoc warning messages. -1 release audit 0m 17s The applied patch generated 1 release audit warnings. -1 checkstyle 2m 4s The applied patch generated 1 new checkstyle issues (total was 211, now 211). +1 whitespace 2m 33s The patch has no lines that end in whitespace. +1 install 1m 40s mvn install still works. +1 eclipse:eclipse 0m 36s The patch built with eclipse:eclipse. +1 findbugs 4m 39s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 0m 23s Tests passed in hadoop-yarn-api. +1 yarn tests 2m 2s Tests passed in hadoop-yarn-common. +1 yarn tests 8m 47s Tests passed in hadoop-yarn-server-nodemanager.     62m 6s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12764917/YARN-2902.07.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 3b85bd7 Pre-patch Findbugs warnings https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html Release Audit https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/patchReleaseAuditProblems.txt checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9342/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9342/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Ok. Then what I will do is NOT wait for completion of running tasks which have been cancelled.
          Localizer will try to only delete directories for which download was complete. Tasks which have failed, directories for them will anyways be deleted by FSDownload.

          We however may need a config in NM for deletion task delay(the one I have added in current patch). Or we can simply have a hardcoded value of 2 minutes.
          Regarding System exit, it will called after ExecutorService#shutdownNow(which will only interrupt running tasks and not wait for them) anyways.

          Show
          varun_saxena Varun Saxena added a comment - Ok. Then what I will do is NOT wait for completion of running tasks which have been cancelled. Localizer will try to only delete directories for which download was complete. Tasks which have failed, directories for them will anyways be deleted by FSDownload. We however may need a config in NM for deletion task delay(the one I have added in current patch). Or we can simply have a hardcoded value of 2 minutes. Regarding System exit, it will called after ExecutorService#shutdownNow(which will only interrupt running tasks and not wait for them) anyways.
          Hide
          jlowe Jason Lowe added a comment -

          In container localizer, when processing HB DIE response, we send another localizer status to NM. Is it really required ? What do you think ?

          I don't think this is required. If the NM is telling the localizer to DIE then I don't think the NM cares after that point what the localizer is doing. The NM is totally done with it at that point due to failure or lack of knowledge of that localizer.

          Or are you suggesting System exit ?

          I was basically suggesting System.exit if we aren't convinced that the localizer can actually tear down in a timely manner. For example, if the graceful shutdown could involve waiting for active transfers to complete because we can't reliably interrupt them, then yes I think System.exit is appropriate. A good compromise would be to put a timeout on shutdown – if we can't get down within so many seconds then have something (e.g.: a watchdog thread if necessary) call System.exit to get out. Otherwise the localizer could still be running and messing with the filesystem after the NM tries to cleanup afterwards.

          Worst-case scenario is this could still happen even with these fixes, but it should resolve the leaking issue for the vast majority of cases. We can make it more bulletproof in a followup JIRA for 2.8 or later that actually has the NM tracking localizer pids and proactively killing them if they don't respond in a timely manner to commands.

          However we can also let localizer not do any cleanup at all and let NM delete paths.

          I would still like the localizer to try to perform some cleanup if possible, as the NM doesn't track localizers in the state store. Therefore if the NM restarts we may not cleanup everything properly if the localizer doesn't do it on its own.

          Show
          jlowe Jason Lowe added a comment - In container localizer, when processing HB DIE response, we send another localizer status to NM. Is it really required ? What do you think ? I don't think this is required. If the NM is telling the localizer to DIE then I don't think the NM cares after that point what the localizer is doing. The NM is totally done with it at that point due to failure or lack of knowledge of that localizer. Or are you suggesting System exit ? I was basically suggesting System.exit if we aren't convinced that the localizer can actually tear down in a timely manner. For example, if the graceful shutdown could involve waiting for active transfers to complete because we can't reliably interrupt them, then yes I think System.exit is appropriate. A good compromise would be to put a timeout on shutdown – if we can't get down within so many seconds then have something (e.g.: a watchdog thread if necessary) call System.exit to get out. Otherwise the localizer could still be running and messing with the filesystem after the NM tries to cleanup afterwards. Worst-case scenario is this could still happen even with these fixes, but it should resolve the leaking issue for the vast majority of cases. We can make it more bulletproof in a followup JIRA for 2.8 or later that actually has the NM tracking localizer pids and proactively killing them if they don't respond in a timely manner to commands. However we can also let localizer not do any cleanup at all and let NM delete paths. I would still like the localizer to try to perform some cleanup if possible, as the NM doesn't track localizers in the state store. Therefore if the NM restarts we may not cleanup everything properly if the localizer doesn't do it on its own.
          Hide
          varun_saxena Varun Saxena added a comment -

          I should say "This way localizer should quit quickly after attempting some cleanup and NM can also attempt cleanup."

          However we can also let localizer not do any cleanup at all and let NM delete paths. That is also an option.

          Show
          varun_saxena Varun Saxena added a comment - I should say "This way localizer should quit quickly after attempting some cleanup and NM can also attempt cleanup." However we can also let localizer not do any cleanup at all and let NM delete paths. That is also an option.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe,

          As far as properly handling DIE so we actually stop downloading and problems canceling active transfers, can't we just have the localizer forcibly tear down the JVM? If we're being told to DIE then I assume we really don't care about pending transfers completing and just want to get out. If the NM is going to clean up after the localizer anyway, seems like we can drastically simplify DIE handling and just exit the JVM. That seems like a change that's targeted enough to be appropriate for 2.7 instead of adding localizer kill support, etc.

          In container localizer, when processing HB DIE response, we send another localizer status to NM. Is it really required ? What do you think ?
          I think as soon as we get DIE, we can follow current code of cancelling pending tasks, although not wait for them to complete(as is being done in newly added code in patch) and delete paths reported in last status. And then just return from the loop for a graceful shutdown(after stopping executors).
          Or are you suggesting System exit ?

          From the NM side, we can have a deletion task after some configured delay(same as right now). We will never cancel this deletion task though unlike code in patch now.

          This way localizer should quit quickly and NM can cleanup.
          I will change the behavior of executor on deletion as well i.e. I will ignore missing paths by default. Wont add flag.

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , As far as properly handling DIE so we actually stop downloading and problems canceling active transfers, can't we just have the localizer forcibly tear down the JVM? If we're being told to DIE then I assume we really don't care about pending transfers completing and just want to get out. If the NM is going to clean up after the localizer anyway, seems like we can drastically simplify DIE handling and just exit the JVM. That seems like a change that's targeted enough to be appropriate for 2.7 instead of adding localizer kill support, etc. In container localizer, when processing HB DIE response, we send another localizer status to NM. Is it really required ? What do you think ? I think as soon as we get DIE, we can follow current code of cancelling pending tasks, although not wait for them to complete(as is being done in newly added code in patch) and delete paths reported in last status. And then just return from the loop for a graceful shutdown(after stopping executors). Or are you suggesting System exit ? From the NM side, we can have a deletion task after some configured delay(same as right now). We will never cancel this deletion task though unlike code in patch now. This way localizer should quit quickly and NM can cleanup. I will change the behavior of executor on deletion as well i.e. I will ignore missing paths by default. Wont add flag.
          Hide
          jlowe Jason Lowe added a comment -

          For NM side, currently it does not kill localizers. We can track PID and kill it as discussed earlier if HB doesnt come for a configured period.

          Yeah, I think long-term to make this a lot more stable and reliable we're going to need the ability for the NM to kill localizers explicitly rather than via request. As you mentioned, the concern is that the localizer will not actually interrupt and stop the localization. Having the NM forcibly kill the localizer means we can put less trust in the localizer to always get that right. However that's probably a lot of work and churn to the code which makes it less palatable for a 2.7 inclusion. Ideally we should target a minimal change for 2.7 that gets us past the main problems we're having today, and we can add more bulletproofing in followup JIRAs for subsequent releases.

          As far as properly handling DIE so we actually stop downloading and problems canceling active transfers, can't we just have the localizer forcibly tear down the JVM? If we're being told to DIE then I assume we really don't care about pending transfers completing and just want to get out. If the NM is going to clean up after the localizer anyway, seems like we can drastically simplify DIE handling and just exit the JVM. That seems like a change that's targeted enough to be appropriate for 2.7 instead of adding localizer kill support, etc.

          Show
          jlowe Jason Lowe added a comment - For NM side, currently it does not kill localizers. We can track PID and kill it as discussed earlier if HB doesnt come for a configured period. Yeah, I think long-term to make this a lot more stable and reliable we're going to need the ability for the NM to kill localizers explicitly rather than via request. As you mentioned, the concern is that the localizer will not actually interrupt and stop the localization. Having the NM forcibly kill the localizer means we can put less trust in the localizer to always get that right. However that's probably a lot of work and churn to the code which makes it less palatable for a 2.7 inclusion. Ideally we should target a minimal change for 2.7 that gets us past the main problems we're having today, and we can add more bulletproofing in followup JIRAs for subsequent releases. As far as properly handling DIE so we actually stop downloading and problems canceling active transfers, can't we just have the localizer forcibly tear down the JVM? If we're being told to DIE then I assume we really don't care about pending transfers completing and just want to get out. If the NM is going to clean up after the localizer anyway, seems like we can drastically simplify DIE handling and just exit the JVM. That seems like a change that's targeted enough to be appropriate for 2.7 instead of adding localizer kill support, etc.
          Hide
          varun_saxena Varun Saxena added a comment -

          Container localizers should already have the concept of heartbeating and killing themselves if they don't hear from the NM within X seconds, and likewise the NM should kill localizers that don't heartbeat in a timely fashion.

          For container localizer I think ipc timeout should work, if configured. Will check it.
          For NM side, currently it does not kill localizers. We can track PID and kill it as discussed earlier if HB doesnt come for a configured period. We can do similar to what we do now for containers now. SIGTERM followed by SIGKILL, if required.
          Should we add this then ? This would mean that we will have to relaunch a new localizer again, if container is still running. Or fail the container ?

          I'm also not sure we need deletion task cancellation. As you point out it's not really necessary.

          Ok. Will remove it.

          Also do we really need a flag to say whether we want it to ignore missing paths? Wondering if we should just ignore cases where the path doesn't exist.

          Yes, we can simply ignore missing paths. Did not change so as not to break previous behavior. Doesnt seem like anyone is depending on this behavior though.

          What if we have the localizer register the temporary working directory (i.e.: the _tmp paths) as deleteOnExit paths?

          Currently _tmp paths are deleted in finally in FSDownload#call. Wouldnt that be enough to handle case of normal JVM exit ?

          With this I don't think we need to change the localizer protocol – DIE means try to cleanup, but NM will always cleanup anyway so no need to wait around and try too hard. Its actually more important that the localizer gets out of the way in a timely manner than it is for it to cleanup since the NM will be the backup in case the localizer fails.

          For this the only concern I see is what I mentioned about the issue I found in FSDownload, that is the download task running even after cancel because code is uninterruptible at places in FSDownload#call.
          In this case, we can never know when the cancelled task will complete and create files in the directory. There can be a race which can lead to tmp directory being renamed for instance.
          We can in this case in deletion task, first put the _tmp dir and then the real one so that first tmp is deleted.
          Also we can add an extra number of seconds to localizer HB timeout for scheduling file deletion, so that localizer is killed (Assuming we adopt approach mentioned in first point, above) before we attempt deletion.

          We however would need to change the localizer protocol as well.
          Currently localizer deletes the entry from its pending resources map as soon as it sends a status. And NM will send a DIE in two cases - 1) If container has been killed 2)NM processes HB and finds one of the status to be FETCH_FAILED. In this case we cannot know if the resources for which fetch was success were actually processed by NM or not. So I maintain a separate list for resources reported to localizer. Hence a flag, so that even those resources can be deleted.

          Show
          varun_saxena Varun Saxena added a comment - Container localizers should already have the concept of heartbeating and killing themselves if they don't hear from the NM within X seconds, and likewise the NM should kill localizers that don't heartbeat in a timely fashion. For container localizer I think ipc timeout should work, if configured. Will check it. For NM side, currently it does not kill localizers. We can track PID and kill it as discussed earlier if HB doesnt come for a configured period. We can do similar to what we do now for containers now. SIGTERM followed by SIGKILL, if required. Should we add this then ? This would mean that we will have to relaunch a new localizer again, if container is still running. Or fail the container ? I'm also not sure we need deletion task cancellation. As you point out it's not really necessary. Ok. Will remove it. Also do we really need a flag to say whether we want it to ignore missing paths? Wondering if we should just ignore cases where the path doesn't exist. Yes, we can simply ignore missing paths. Did not change so as not to break previous behavior. Doesnt seem like anyone is depending on this behavior though. What if we have the localizer register the temporary working directory (i.e.: the _tmp paths) as deleteOnExit paths? Currently _tmp paths are deleted in finally in FSDownload#call. Wouldnt that be enough to handle case of normal JVM exit ? With this I don't think we need to change the localizer protocol – DIE means try to cleanup, but NM will always cleanup anyway so no need to wait around and try too hard. Its actually more important that the localizer gets out of the way in a timely manner than it is for it to cleanup since the NM will be the backup in case the localizer fails. For this the only concern I see is what I mentioned about the issue I found in FSDownload, that is the download task running even after cancel because code is uninterruptible at places in FSDownload#call. In this case, we can never know when the cancelled task will complete and create files in the directory. There can be a race which can lead to tmp directory being renamed for instance. We can in this case in deletion task, first put the _tmp dir and then the real one so that first tmp is deleted. Also we can add an extra number of seconds to localizer HB timeout for scheduling file deletion, so that localizer is killed (Assuming we adopt approach mentioned in first point, above) before we attempt deletion. We however would need to change the localizer protocol as well. Currently localizer deletes the entry from its pending resources map as soon as it sends a status. And NM will send a DIE in two cases - 1) If container has been killed 2)NM processes HB and finds one of the status to be FETCH_FAILED. In this case we cannot know if the resources for which fetch was success were actually processed by NM or not. So I maintain a separate list for resources reported to localizer. Hence a flag, so that even those resources can be deleted.
          Hide
          jlowe Jason Lowe added a comment -

          Wow, the patch has gotten a lot larger. Will take me a while to fully digest all of the changes.

          Would we want this to go into 2.7.2 ?

          Ideally yes. We're seeing significant leakage of local resources on many of our nodes due to this problem.

          Back to the patch itself, at a quick glance I'm lukewarm on a couple of points. I'm not thrilled with adding yet another config that users need to tune properly or it doesn't work correctly. Container localizers should already have the concept of heartbeating and killing themselves if they don't hear from the NM within X seconds, and likewise the NM should kill localizers that don't heartbeat in a timely fashion. It seems to me that's the time interval we should be using to determine how long to wait before giving up and having the NM do the cleanup.

          I'm also not sure we need deletion task cancellation. As you point out it's not really necessary. The files being deleted should not be reused later, so there should be no harm in attempting to redundantly delete them. If we decide that we really should have cancellation of deletion tasks then that should be implemented as a separate JIRA to help keep this patch more manageable since that's a readily separable feature that can stand on its own.

          Also do we really need a flag to say whether we want it to ignore missing paths? Wondering if we should just ignore cases where the path doesn't exist. I can see it either way I guess.

          I wonder if we can solve this with a simpler approach that should work well in most cases. What if we have the localizer register the temporary working directory (i.e.: the _tmp paths) as deleteOnExit paths? Then the localizer should try to clean these up if there's anything at all resembling a "normal" JVM exit, and I believe the JVM ignores cases when the file is already missing. Then we only have to worry about the case where the localizer dies a horrible death and the JVM doesn't get a chance to cleanup. To cover that case, when the NM kills the localizer we also schedule a deletion of the dest and dest_tmp paths. With this I don't think we need to change the localizer protocol – DIE means try to cleanup, but NM will always cleanup anyway so no need to wait around and try too hard. Its actually more important that the localizer gets out of the way in a timely manner than it is for it to cleanup since the NM will be the backup in case the localizer fails.

          Show
          jlowe Jason Lowe added a comment - Wow, the patch has gotten a lot larger. Will take me a while to fully digest all of the changes. Would we want this to go into 2.7.2 ? Ideally yes. We're seeing significant leakage of local resources on many of our nodes due to this problem. Back to the patch itself, at a quick glance I'm lukewarm on a couple of points. I'm not thrilled with adding yet another config that users need to tune properly or it doesn't work correctly. Container localizers should already have the concept of heartbeating and killing themselves if they don't hear from the NM within X seconds, and likewise the NM should kill localizers that don't heartbeat in a timely fashion. It seems to me that's the time interval we should be using to determine how long to wait before giving up and having the NM do the cleanup. I'm also not sure we need deletion task cancellation. As you point out it's not really necessary. The files being deleted should not be reused later, so there should be no harm in attempting to redundantly delete them. If we decide that we really should have cancellation of deletion tasks then that should be implemented as a separate JIRA to help keep this patch more manageable since that's a readily separable feature that can stand on its own. Also do we really need a flag to say whether we want it to ignore missing paths? Wondering if we should just ignore cases where the path doesn't exist. I can see it either way I guess. I wonder if we can solve this with a simpler approach that should work well in most cases. What if we have the localizer register the temporary working directory (i.e.: the _tmp paths) as deleteOnExit paths? Then the localizer should try to clean these up if there's anything at all resembling a "normal" JVM exit, and I believe the JVM ignores cases when the file is already missing. Then we only have to worry about the case where the localizer dies a horrible death and the JVM doesn't get a chance to cleanup. To cover that case, when the NM kills the localizer we also schedule a deletion of the dest and dest_tmp paths. With this I don't think we need to change the localizer protocol – DIE means try to cleanup, but NM will always cleanup anyway so no need to wait around and try too hard. Its actually more important that the localizer gets out of the way in a timely manner than it is for it to cleanup since the NM will be the backup in case the localizer fails.
          Hide
          varun_saxena Varun Saxena added a comment -

          Would we want this to go into 2.7.2 ?

          Show
          varun_saxena Varun Saxena added a comment - Would we want this to go into 2.7.2 ?
          Hide
          varun_saxena Varun Saxena added a comment -

          Implementation details are as under :

          (FSDownload.java)

          • While going through the code, found one issue in current code. FSDownload#call is a mix of interruptible and uninterruptible code. This means that when Future#cancel(true) is called, the code might not be interrupted and we may continue downloading, even though CancellationException is thrown on Future#get. But in ContainerLocalizer, on CancellationException we send FETCH_FAILURE back to NM and do not do any cleanup(deletion of downloaded resources) expecting FSDownload#call to do the cleanup as it indeed does on other exceptions. FSDownload does not provide any method to do the cleanup either. So to resolve this issue, have added a wait/notify construct to let download task complete(either through exception or normally) and then do the cleanup. Thought of checking for thread being interrupted at the end of FSDownload#call but FileContext#rename in the code resets the interrupted flag so going with this solution.
            This will increase the time taken for localizer to send last heartbeat(HB) to NM though.
            Should I raise another JIRA for this ?

          (ContainerLocalizer.java)

          • A set of paths is maintained for the resources which were successfully downloaded and reported to NM in HB. This is done because if container has been killed, NM would pay no heed to this FETCH_SUCCESS status. So if in HB rsp, NM indicates that resources have to be deleted, we need a reference of what we sent in the HB as well. This extra set is required because we currently do not wait for rsp from NM before deleting entry from pending resources map (delete it at the time of sending HB). Should we change it ?
          • We use this set of paths to do the deletion of resources reported to NM at the time of localizer DIE. pendingResources map is used for cleanup of resources which have not been reported to NM

          (ResourceLocalizationService.java)

          • On container kill, iterate over all the scheduled resources and for the resources which are in downloading state, schedule a deletion task which runs after a delay specified by config yarn.nodemanager.localizer.downloading-rsrcs.deletion.wait-sec
          • An additional map of localizer id(key) and deletion task future + deletion task id for state store(value) is maintained for cancellation of deletion task later(upon HB from localizer).
          • On localizer HB, we would try to cancel the deletion task and remove the task from state store. Moreover, a custom deletion task is created which calls the FileDeletionTask and also removes the entry from map above after deletion task completes. This however can lead to a minor race for removal of task from state store(i.e. race due to task completion and cancellation) if the code flow is not interruptible. But I guess that wont be a problem as removal of task by ID from leveldb would simply not do anything if key does not exist.
          • In case of NM restart(which should be quite rare), the map above would be lost. But that should be fine. Only side effect of this is that running deletion tasks may not be cancelled. But even if we attempt to delete a non existent directory, it should not be a problem.

          (DeletionService.java and Container Executor(s))

          • Made relevant changes in DeletionService to ensure scheduling of deletion tasks with a delay.
          • Made changes to ensure deletion task's future and task id used for state store are returned back to caller for cancellation later. Also added code for cancellation of running task.
          • Added a new flag "ignoreMissingDir" in deletion task. Currently executors will throw an error if any of directory missing amongst the list of baseDirs passed for deletion do not exist. But for this use case we would ignore such missing dir and continue with deletion of other directories. Because NM may have inconsistent view of what all needs to be deleted. The flag hence has been added to support this new behavior and continue with old behavior(if flag is false).
          • Changes have been made in container-executor as well for support of this flag. An additional command line parameter has been added (for delete as user command). A value of 0 is analogous to the flag being false and 1 means true.

          Few additional points to note

          1. xxxx_tmp directories should be deleted by localizer itself(in FSDownload). Have added it because as you said localizers can sometimes turn rogue. In this case, NM will delete these dirs. Should I handle tmp dirs then ?
          2. I am assuming it would be fine to change command line parameters for a specific container-executor command in 2.8. If we do not want to change command line parameters for container-executor, there can be workarounds.
          Show
          varun_saxena Varun Saxena added a comment - Implementation details are as under : (FSDownload.java) While going through the code, found one issue in current code. FSDownload#call is a mix of interruptible and uninterruptible code. This means that when Future#cancel(true) is called, the code might not be interrupted and we may continue downloading, even though CancellationException is thrown on Future#get. But in ContainerLocalizer, on CancellationException we send FETCH_FAILURE back to NM and do not do any cleanup(deletion of downloaded resources) expecting FSDownload#call to do the cleanup as it indeed does on other exceptions. FSDownload does not provide any method to do the cleanup either. So to resolve this issue, have added a wait/notify construct to let download task complete(either through exception or normally) and then do the cleanup. Thought of checking for thread being interrupted at the end of FSDownload#call but FileContext#rename in the code resets the interrupted flag so going with this solution. This will increase the time taken for localizer to send last heartbeat(HB) to NM though. Should I raise another JIRA for this ? (ContainerLocalizer.java) A set of paths is maintained for the resources which were successfully downloaded and reported to NM in HB. This is done because if container has been killed, NM would pay no heed to this FETCH_SUCCESS status. So if in HB rsp, NM indicates that resources have to be deleted, we need a reference of what we sent in the HB as well. This extra set is required because we currently do not wait for rsp from NM before deleting entry from pending resources map (delete it at the time of sending HB). Should we change it ? We use this set of paths to do the deletion of resources reported to NM at the time of localizer DIE. pendingResources map is used for cleanup of resources which have not been reported to NM (ResourceLocalizationService.java) On container kill, iterate over all the scheduled resources and for the resources which are in downloading state, schedule a deletion task which runs after a delay specified by config yarn.nodemanager.localizer.downloading-rsrcs.deletion.wait-sec An additional map of localizer id(key) and deletion task future + deletion task id for state store(value) is maintained for cancellation of deletion task later(upon HB from localizer). On localizer HB, we would try to cancel the deletion task and remove the task from state store. Moreover, a custom deletion task is created which calls the FileDeletionTask and also removes the entry from map above after deletion task completes. This however can lead to a minor race for removal of task from state store(i.e. race due to task completion and cancellation) if the code flow is not interruptible. But I guess that wont be a problem as removal of task by ID from leveldb would simply not do anything if key does not exist. In case of NM restart(which should be quite rare), the map above would be lost. But that should be fine. Only side effect of this is that running deletion tasks may not be cancelled. But even if we attempt to delete a non existent directory, it should not be a problem. (DeletionService.java and Container Executor(s)) Made relevant changes in DeletionService to ensure scheduling of deletion tasks with a delay. Made changes to ensure deletion task's future and task id used for state store are returned back to caller for cancellation later. Also added code for cancellation of running task. Added a new flag "ignoreMissingDir" in deletion task. Currently executors will throw an error if any of directory missing amongst the list of baseDirs passed for deletion do not exist. But for this use case we would ignore such missing dir and continue with deletion of other directories. Because NM may have inconsistent view of what all needs to be deleted. The flag hence has been added to support this new behavior and continue with old behavior(if flag is false). Changes have been made in container-executor as well for support of this flag. An additional command line parameter has been added (for delete as user command). A value of 0 is analogous to the flag being false and 1 means true. Few additional points to note xxxx_tmp directories should be deleted by localizer itself(in FSDownload). Have added it because as you said localizers can sometimes turn rogue. In this case, NM will delete these dirs. Should I handle tmp dirs then ? I am assuming it would be fine to change command line parameters for a specific container-executor command in 2.8. If we do not want to change command line parameters for container-executor , there can be workarounds.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, kindly review.

          The patch at a very high level does the following :

          1. On container kill, NM(localization service) will create deletion task for all the downloading resources and schedule it to run after a configured delay(new config added for it). Made the decision to not wait for HB from localizer first because we would not want to depend on localizer if there is some problem there and it does not send HB.
          2. On subsequent HB from localizer, NM will indicate to localizer that it can delete the downloading resources by itself after cancelling download tasks. Added a boolean flag in proto for this.
          3. After localizer deletes the resources, it will be send last HB to NM. A boolean flag has been added in proto to indicate this to NM. On receiving this HB, NM will cancel the deletion task so that deletion is not attempted by NM as well. Although its not a problem even if we attempt deletion because if nothing can be deleted, deletion task wont do anything. But if deletion task can be cancelled, then why not.
          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , kindly review. The patch at a very high level does the following : On container kill, NM(localization service) will create deletion task for all the downloading resources and schedule it to run after a configured delay(new config added for it). Made the decision to not wait for HB from localizer first because we would not want to depend on localizer if there is some problem there and it does not send HB. On subsequent HB from localizer, NM will indicate to localizer that it can delete the downloading resources by itself after cancelling download tasks. Added a boolean flag in proto for this. After localizer deletes the resources, it will be send last HB to NM. A boolean flag has been added in proto to indicate this to NM. On receiving this HB, NM will cancel the deletion task so that deletion is not attempted by NM as well. Although its not a problem even if we attempt deletion because if nothing can be deleted, deletion task wont do anything. But if deletion task can be cancelled, then why not.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 18m 51s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 10 new or modified test files.
          +1 javac 7m 44s There were no new javac warning messages.
          +1 javadoc 10m 33s There were no new javadoc warning messages.
          +1 release audit 0m 25s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 1m 50s The applied patch generated 1 new checkstyle issues (total was 211, now 211).
          -1 checkstyle 2m 38s The applied patch generated 6 new checkstyle issues (total was 344, now 298).
          +1 whitespace 7m 19s The patch has no lines that end in whitespace.
          +1 install 1m 33s mvn install still works.
          +1 eclipse:eclipse 0m 31s The patch built with eclipse:eclipse.
          +1 findbugs 4m 24s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 0m 23s Tests passed in hadoop-yarn-api.
          +1 yarn tests 2m 3s Tests passed in hadoop-yarn-common.
          +1 yarn tests 7m 55s Tests passed in hadoop-yarn-server-nodemanager.
              64m 23s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12755626/YARN-2902.06.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 7269906
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9107/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9107/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 18m 51s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 10 new or modified test files. +1 javac 7m 44s There were no new javac warning messages. +1 javadoc 10m 33s There were no new javadoc warning messages. +1 release audit 0m 25s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 50s The applied patch generated 1 new checkstyle issues (total was 211, now 211). -1 checkstyle 2m 38s The applied patch generated 6 new checkstyle issues (total was 344, now 298). +1 whitespace 7m 19s The patch has no lines that end in whitespace. +1 install 1m 33s mvn install still works. +1 eclipse:eclipse 0m 31s The patch built with eclipse:eclipse. +1 findbugs 4m 24s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 0m 23s Tests passed in hadoop-yarn-api. +1 yarn tests 2m 3s Tests passed in hadoop-yarn-common. +1 yarn tests 7m 55s Tests passed in hadoop-yarn-server-nodemanager.     64m 23s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12755626/YARN-2902.06.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 7269906 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9107/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9107/console This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 19m 32s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 10 new or modified test files.
          +1 javac 7m 56s There were no new javac warning messages.
          +1 javadoc 10m 9s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 1m 48s The applied patch generated 1 new checkstyle issues (total was 211, now 211).
          -1 checkstyle 2m 36s The applied patch generated 10 new checkstyle issues (total was 344, now 306).
          +1 whitespace 6m 49s The patch has no lines that end in whitespace.
          +1 install 1m 25s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 4m 25s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 0m 23s Tests passed in hadoop-yarn-api.
          +1 yarn tests 2m 0s Tests passed in hadoop-yarn-common.
          +1 yarn tests 7m 47s Tests passed in hadoop-yarn-server-nodemanager.
              64m 3s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12755616/YARN-2902.06.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 7269906
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9105/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9105/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 19m 32s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 10 new or modified test files. +1 javac 7m 56s There were no new javac warning messages. +1 javadoc 10m 9s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 1m 48s The applied patch generated 1 new checkstyle issues (total was 211, now 211). -1 checkstyle 2m 36s The applied patch generated 10 new checkstyle issues (total was 344, now 306). +1 whitespace 6m 49s The patch has no lines that end in whitespace. +1 install 1m 25s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 4m 25s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 0m 23s Tests passed in hadoop-yarn-api. +1 yarn tests 2m 0s Tests passed in hadoop-yarn-common. +1 yarn tests 7m 47s Tests passed in hadoop-yarn-server-nodemanager.     64m 3s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12755616/YARN-2902.06.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 7269906 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9105/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9105/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Sorry havent fixed checkstyle, thats related to more than 2000 lines in file.

          Show
          varun_saxena Varun Saxena added a comment - Sorry havent fixed checkstyle, thats related to more than 2000 lines in file.
          Hide
          varun_saxena Varun Saxena added a comment -

          Fixed one related test case, checkstyle and whitespace issues.

          Show
          varun_saxena Varun Saxena added a comment - Fixed one related test case, checkstyle and whitespace issues.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 19m 47s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 9 new or modified test files.
          +1 javac 10m 10s There were no new javac warning messages.
          +1 javadoc 12m 3s There were no new javadoc warning messages.
          +1 release audit 0m 30s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 2m 25s The applied patch generated 1 new checkstyle issues (total was 211, now 211).
          -1 whitespace 7m 9s The patch has 10 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 47s mvn install still works.
          +1 eclipse:eclipse 0m 37s The patch built with eclipse:eclipse.
          +1 findbugs 4m 57s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 0m 27s Tests passed in hadoop-yarn-api.
          +1 yarn tests 2m 13s Tests passed in hadoop-yarn-common.
          -1 yarn tests 7m 56s Tests failed in hadoop-yarn-server-nodemanager.
              70m 56s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.nodemanager.TestLinuxContainerExecutorWithMocks
            hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12755609/YARN-2902.05.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 7269906
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/whitespace.txt
          hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-api.txt
          hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-common.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9103/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9103/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 19m 47s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 9 new or modified test files. +1 javac 10m 10s There were no new javac warning messages. +1 javadoc 12m 3s There were no new javadoc warning messages. +1 release audit 0m 30s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 2m 25s The applied patch generated 1 new checkstyle issues (total was 211, now 211). -1 whitespace 7m 9s The patch has 10 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 47s mvn install still works. +1 eclipse:eclipse 0m 37s The patch built with eclipse:eclipse. +1 findbugs 4m 57s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 0m 27s Tests passed in hadoop-yarn-api. +1 yarn tests 2m 13s Tests passed in hadoop-yarn-common. -1 yarn tests 7m 56s Tests failed in hadoop-yarn-server-nodemanager.     70m 56s   Reason Tests Failed unit tests hadoop.yarn.server.nodemanager.TestLinuxContainerExecutorWithMocks   hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12755609/YARN-2902.05.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 7269906 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/whitespace.txt hadoop-yarn-api test log https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-api.txt hadoop-yarn-common test log https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-common.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9103/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9103/console This message was automatically generated.
          Hide
          r7raul JiJun Tang added a comment -

          I encounter this issue。

          Enviroment: hadoop 2.3.0-cdh5.0.2

          2015-08-05 12:11:18,275 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://xxxx/user/xxx/.staging/job_1433219182593_445774/libjars/zookeeper-3.4.5-cdh5.0.2.jar, 1436271746726, FILE, null },pending,[],1624036313203834,DOWNLOADING} with non-zero refcount /var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAG
          
          Show
          r7raul JiJun Tang added a comment - I encounter this issue。 Enviroment: hadoop 2.3.0-cdh5.0.2 2015-08-05 12:11:18,275 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://xxxx/user/xxx/.staging/job_1433219182593_445774/libjars/zookeeper-3.4.5-cdh5.0.2.jar, 1436271746726, FILE, null },pending,[],1624036313203834,DOWNLOADING} with non-zero refcount /var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAG
          Hide
          varun_saxena Varun Saxena added a comment -

          Yeah lets go with this approach.

          Show
          varun_saxena Varun Saxena added a comment - Yeah lets go with this approach.
          Hide
          jlowe Jason Lowe added a comment -

          But here the issue can be what if in between checks, localizer dies and PID is taken by some other process.

          The NM is tracking containers. If it can track containers then it can track localizers. There may be issues with PID recycling, but they're not specific to tracking localizers.

          One option would be to add a status in heartbeat asking localizer to cleanup(stop its downloading threads) and once that is done, indicate NM to do the deletion in another heartbeat.

          That would be a bit more consistent since the process creating the files is the process that cleans them when it aborts. It also helps keep the disk clean if there ever were a rogue localizer that the NM doesn't know about. Expecting the NM to clean up after a localizer that the NM has no idea what it's doing is going to be difficult. However we still have to handle the fallback case where the localizer simply crashes and something needs to clean up after it. Speaking of the localizer crashing, today the localizer can "go rogue" and stop heartbeating, and I don't think anything will detect this. In that situation we also fail to cleanup since we'll wait for a heartbeat that will never arrive.

          So how about the following approach:

          • ContainerLocalizer cleans up temporary files when processing the DIE command
          • We give the localizer a significant amount of time (e.g.: a minute or two?) to clean up on its own after receiving a DIE command
          • After the localizer exits (or if it does not exit after the grace period) then the NM cleans up the resources itself. We aren't tracking the PID explicitly, but LocalizerRunner can take actions when the localizer exits (e.g.: either process the requested deletions directly or send an event to a subsystem that will).

          With this approach rogue localizers will make some attempt to cleanup after themselves (which they don't do today), and we won't wait any longer than necessary before the NM tries to cleanup after well-behaved localizers or localizers that crash.

          Show
          jlowe Jason Lowe added a comment - But here the issue can be what if in between checks, localizer dies and PID is taken by some other process. The NM is tracking containers. If it can track containers then it can track localizers. There may be issues with PID recycling, but they're not specific to tracking localizers. One option would be to add a status in heartbeat asking localizer to cleanup(stop its downloading threads) and once that is done, indicate NM to do the deletion in another heartbeat. That would be a bit more consistent since the process creating the files is the process that cleans them when it aborts. It also helps keep the disk clean if there ever were a rogue localizer that the NM doesn't know about. Expecting the NM to clean up after a localizer that the NM has no idea what it's doing is going to be difficult. However we still have to handle the fallback case where the localizer simply crashes and something needs to clean up after it. Speaking of the localizer crashing, today the localizer can "go rogue" and stop heartbeating, and I don't think anything will detect this. In that situation we also fail to cleanup since we'll wait for a heartbeat that will never arrive. So how about the following approach: ContainerLocalizer cleans up temporary files when processing the DIE command We give the localizer a significant amount of time (e.g.: a minute or two?) to clean up on its own after receiving a DIE command After the localizer exits (or if it does not exit after the grace period) then the NM cleans up the resources itself. We aren't tracking the PID explicitly, but LocalizerRunner can take actions when the localizer exits (e.g.: either process the requested deletions directly or send an event to a subsystem that will). With this approach rogue localizers will make some attempt to cleanup after themselves (which they don't do today), and we won't wait any longer than necessary before the NM tries to cleanup after well-behaved localizers or localizers that crash.
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks for the review Jason Lowe.

          I think it would be better for the executor to let us know when a localizer has completed rather than assuming 1 second will be enough time (or too much time). We can tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors are tracking localizers today.

          We do not track localizers from executors. But issue is how do we track them ? Get PID of the localizer process and check if localizer has died ? But here the issue can be what if in between checks, localizer dies and PID is taken by some other process.
          We primarily want localizer to die so that it doesn't download anything after we do the deletion.
          One option would be to add a status in heartbeat asking localizer to cleanup(stop its downloading threads) and once that is done, indicate NM to do the deletion in another heartbeat. On this HB, NM can do the deletion and Localizer on HB response can DIE. Thoughts ?

          Show
          varun_saxena Varun Saxena added a comment - Thanks for the review Jason Lowe . I think it would be better for the executor to let us know when a localizer has completed rather than assuming 1 second will be enough time (or too much time). We can tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors are tracking localizers today. We do not track localizers from executors. But issue is how do we track them ? Get PID of the localizer process and check if localizer has died ? But here the issue can be what if in between checks, localizer dies and PID is taken by some other process. We primarily want localizer to die so that it doesn't download anything after we do the deletion. One option would be to add a status in heartbeat asking localizer to cleanup(stop its downloading threads) and once that is done, indicate NM to do the deletion in another heartbeat. On this HB, NM can do the deletion and Localizer on HB response can DIE. Thoughts ?
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch, Varun!

          Is one second enough time for the localizer to tear down if the system is heavily loaded, disks are slow, etc.? I think it would be better for the executor to let us know when a localizer has completed rather than assuming 1 second will be enough time (or too much time). We can tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors are tracking localizers today.

          There are a number of sleeps in the unit test which we should try to avoid if possible. Is there a reason dispatcher.await() isn't sufficient to avoid the races? At a minimum there should be a comment for each one explaining what we're trying to avoid by sleeping.

          Nit: I've always interpreted the debug delay to be a delay to execute in debugging just before the NM deletes a file. To be consistent it seems that we should be adding the debug delay to any requested delay. That way the NM will always preserve a file for debugDelay seconds beyond what an NM with debugDelay=0 seconds would do.

          Nit: The TODO in DeletionService about parent being owned by NM, etc. probably only needs to be in the delete method that actually does the work rather than duplicated in veneer methods.

          Nit: Should "Container killed while downloading" be "Container killed while localizing"? We use localizing elsewhere (e.g.: NM log UI when trying to get logs of a container that is still localizing).

          Nit: "Inorrect path for PRIVATE localization." should be "Incorrect path for PRIVATE localization: " to fix typo and add trailing space for subsequent filename. Missing a trailing space on the next log message as well. Realize this was just a pre-existing bug, but it would be nice to fix as part of moving the code.

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch, Varun! Is one second enough time for the localizer to tear down if the system is heavily loaded, disks are slow, etc.? I think it would be better for the executor to let us know when a localizer has completed rather than assuming 1 second will be enough time (or too much time). We can tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors are tracking localizers today. There are a number of sleeps in the unit test which we should try to avoid if possible. Is there a reason dispatcher.await() isn't sufficient to avoid the races? At a minimum there should be a comment for each one explaining what we're trying to avoid by sleeping. Nit: I've always interpreted the debug delay to be a delay to execute in debugging just before the NM deletes a file. To be consistent it seems that we should be adding the debug delay to any requested delay. That way the NM will always preserve a file for debugDelay seconds beyond what an NM with debugDelay=0 seconds would do. Nit: The TODO in DeletionService about parent being owned by NM, etc. probably only needs to be in the delete method that actually does the work rather than duplicated in veneer methods. Nit: Should "Container killed while downloading" be "Container killed while localizing"? We use localizing elsewhere (e.g.: NM log UI when trying to get logs of a container that is still localizing). Nit: "Inorrect path for PRIVATE localization." should be "Incorrect path for PRIVATE localization: " to fix typo and add trailing space for subsequent filename. Missing a trailing space on the next log message as well. Realize this was just a pre-existing bug, but it would be nice to fix as part of moving the code.
          Hide
          sjlee0 Sangjin Lee added a comment -

          I'm OK with this JIRA proceeding as is. We'll need to isolate the public resource case more, and it won't be too late to file a separate issue if we do that later.

          Show
          sjlee0 Sangjin Lee added a comment - I'm OK with this JIRA proceeding as is. We'll need to isolate the public resource case more, and it won't be too late to file a separate issue if we do that later.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 15m 47s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 37s There were no new javac warning messages.
          +1 javadoc 9m 36s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 0m 37s The applied patch generated 9 new checkstyle issues (total was 168, now 138).
          +1 whitespace 0m 3s The patch has no lines that end in whitespace.
          +1 install 1m 33s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 1m 13s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 6m 6s Tests passed in hadoop-yarn-server-nodemanager.
              43m 32s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12741309/YARN-2902.04.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 41ae776
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8326/testReport/
          Java 1.7.0_55
          uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8326/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 15m 47s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 37s There were no new javac warning messages. +1 javadoc 9m 36s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 37s The applied patch generated 9 new checkstyle issues (total was 168, now 138). +1 whitespace 0m 3s The patch has no lines that end in whitespace. +1 install 1m 33s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 13s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 6m 6s Tests passed in hadoop-yarn-server-nodemanager.     43m 32s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12741309/YARN-2902.04.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 41ae776 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8326/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8326/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Fixed checkstyle issues. A lot of changes in ResourceLocalizationService#findNextResource are due to indentation issues reported by checkstyle. Hence, had to change code(indent) which I had not written.

          Show
          varun_saxena Varun Saxena added a comment - Fixed checkstyle issues. A lot of changes in ResourceLocalizationService#findNextResource are due to indentation issues reported by checkstyle. Hence, had to change code(indent) which I had not written.
          Hide
          varun_saxena Varun Saxena added a comment -

          Sorry meant below.
          2. On Heartbeat from container localizer, if localizer runner is already stopped, we can indicate the container localizer to do the cleanup for downloading resources.

          Show
          varun_saxena Varun Saxena added a comment - Sorry meant below. 2. On Heartbeat from container localizer, if localizer runner is already stopped, we can indicate the container localizer to do the cleanup for downloading resources.
          Hide
          varun_saxena Varun Saxena added a comment -

          Looking at the public localization code, I do not think public resources can be orphaned because we do not stop localization for them midway on container cleanup.
          Its difficult to ascertain though from logs as to why localization was failing in the scenario mentioned above for public resources. Whatever little I could look into the code, I could not find anything concrete which can explain the failures.

          Anyways, the scope of this JIRA, i.e. orphaning of resources would not happen for PUBLIC resources IMHO. And I guess there is no point further delaying this JIRA hoping to find out what went wrong with public resources in scenario above.

          What's not clear to me is whether the trigger was the public localization timing out or the stopContainer request

          Reference can become 0 if container is killed while downloading.

          Coming to the patch, there are two approaches to handle this.

          1. Cleanup for downloading resources can be done by Localization Service while doing container cleanup.
          2. On Heartbeat from container localizer, if localizer runner is already stopped, we can indicate the localizer runner to do the cleanup for downloading resources.

          The patch attached adopts approach 1.
          Herein, we wait for container localizer to die before running deletion tasks. Also, downloading resources can either be in local directory or in local directory suffixed by _tmp. So we try for both.
          Moreover, localization failed event is sent to all the containers which are referring to the resource which is in downloading state.

          Show
          varun_saxena Varun Saxena added a comment - Looking at the public localization code, I do not think public resources can be orphaned because we do not stop localization for them midway on container cleanup. Its difficult to ascertain though from logs as to why localization was failing in the scenario mentioned above for public resources. Whatever little I could look into the code, I could not find anything concrete which can explain the failures. Anyways, the scope of this JIRA, i.e. orphaning of resources would not happen for PUBLIC resources IMHO. And I guess there is no point further delaying this JIRA hoping to find out what went wrong with public resources in scenario above. What's not clear to me is whether the trigger was the public localization timing out or the stopContainer request Reference can become 0 if container is killed while downloading. Coming to the patch, there are two approaches to handle this. Cleanup for downloading resources can be done by Localization Service while doing container cleanup. On Heartbeat from container localizer, if localizer runner is already stopped, we can indicate the localizer runner to do the cleanup for downloading resources. The patch attached adopts approach 1. Herein, we wait for container localizer to die before running deletion tasks. Also, downloading resources can either be in local directory or in local directory suffixed by _tmp . So we try for both. Moreover, localization failed event is sent to all the containers which are referring to the resource which is in downloading state.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 15m 40s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
          +1 javac 7m 34s There were no new javac warning messages.
          +1 javadoc 9m 33s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 0m 36s The applied patch generated 25 new checkstyle issues (total was 168, now 187).
          +1 whitespace 0m 4s The patch has no lines that end in whitespace.
          +1 install 1m 33s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 14s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 6m 24s Tests passed in hadoop-yarn-server-nodemanager.
              43m 37s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12741076/YARN-2902.03.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 445b132
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8309/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
          hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8309/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8309/testReport/
          Java 1.7.0_55
          uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8309/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 15m 40s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 34s There were no new javac warning messages. +1 javadoc 9m 33s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 36s The applied patch generated 25 new checkstyle issues (total was 168, now 187). +1 whitespace 0m 4s The patch has no lines that end in whitespace. +1 install 1m 33s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 14s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 6m 24s Tests passed in hadoop-yarn-server-nodemanager.     43m 37s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12741076/YARN-2902.03.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 445b132 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8309/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8309/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8309/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8309/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yeah this is pending for a long time. I have to primarily update tests. Got taken over by other tasks. Don't have any 2.7.1 related JIRAs' with me right now so will update a patch by this weekend.

          Show
          varun_saxena Varun Saxena added a comment - Yeah this is pending for a long time. I have to primarily update tests. Got taken over by other tasks. Don't have any 2.7.1 related JIRAs' with me right now so will update a patch by this weekend.
          Hide
          eepayne Eric Payne added a comment -

          Hi Varun Saxena. Thank you very much for working on and fixing this issue. We are looking forward to your next patch. Do you have an ETA for when that might be?

          Show
          eepayne Eric Payne added a comment - Hi Varun Saxena . Thank you very much for working on and fixing this issue. We are looking forward to your next patch. Do you have an ETA for when that might be?
          Hide
          varun_saxena Varun Saxena added a comment -

          Wangda Tan, sorry for the delay. Was on long leave and have come back today. We are pretty clear on how to handle it for private resources(as per the comment you highlighted) but hadn't updated a patch as need to simulate and investigate further for public resources. I will check it and update ASAP.

          Show
          varun_saxena Varun Saxena added a comment - Wangda Tan , sorry for the delay. Was on long leave and have come back today. We are pretty clear on how to handle it for private resources(as per the comment you highlighted) but hadn't updated a patch as need to simulate and investigate further for public resources. I will check it and update ASAP.
          Show
          leftnoteasy Wangda Tan added a comment - Varun Saxena , could you update patch as you mentioned: https://issues.apache.org/jira/browse/YARN-2902?focusedCommentId=14336875&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14336875?
          Hide
          sjlee0 Sangjin Lee added a comment -

          Sorry it took me a while to get to this. Here is an excerpt of the log when this happened:

          2015-03-05 00:20:17,529 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1418357586203_2035414 transitioned from INITING to RUNNING
          2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1418357586203_2035414_01_000486 transitioned from NEW to LOCALIZING
          2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1418357586203_2035414
          2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_INIT for appId application_1418357586203_2035414
          2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got APPLICATION_INIT for service mapreduce_shuffle
          2015-03-05 00:20:17,532 INFO org.apache.hadoop.mapred.ShuffleHandler: Added token for job_1418357586203_2035414
          
          2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar transitioned from INIT to DOWNLOADING
          2015-03-05 00:20:17,537 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar, 1425430654133, FILE, null }
          
          2015-03-05 00:28:44,388 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=alice  IP=10.53.186.122        OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1418357586203_2035414 CONTAINERID=container_1418357586203_2035414_01_000486
          
          2015-03-05 00:30:51,731 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar, 1425430654133, FILE, null },pending,[],1895876473278238,DOWNLOADING} with non-zero refcount
          

          Around this time there are many public resources in the downloading state that generated this error. I do think these public resources were truly in the middle of being downloaded (yes this can happen). What's not clear to me is whether the trigger was the public localization timing out or the stopContainer request (see in the log). Sorry there is not much more information I can glean from the log. Let me know if you have more questions.

          FYI, we're running basically 2.4.0+.

          Show
          sjlee0 Sangjin Lee added a comment - Sorry it took me a while to get to this. Here is an excerpt of the log when this happened: 2015-03-05 00:20:17,529 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1418357586203_2035414 transitioned from INITING to RUNNING 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1418357586203_2035414_01_000486 transitioned from NEW to LOCALIZING 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1418357586203_2035414 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_INIT for appId application_1418357586203_2035414 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got APPLICATION_INIT for service mapreduce_shuffle 2015-03-05 00:20:17,532 INFO org.apache.hadoop.mapred.ShuffleHandler: Added token for job_1418357586203_2035414 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar transitioned from INIT to DOWNLOADING 2015-03-05 00:20:17,537 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar, 1425430654133, FILE, null } 2015-03-05 00:28:44,388 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=alice IP=10.53.186.122 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1418357586203_2035414 CONTAINERID=container_1418357586203_2035414_01_000486 2015-03-05 00:30:51,731 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar, 1425430654133, FILE, null },pending,[],1895876473278238,DOWNLOADING} with non-zero refcount Around this time there are many public resources in the downloading state that generated this error. I do think these public resources were truly in the middle of being downloaded (yes this can happen). What's not clear to me is whether the trigger was the public localization timing out or the stopContainer request (see in the log). Sorry there is not much more information I can glean from the log. Let me know if you have more questions. FYI, we're running basically 2.4.0+.
          Hide
          sjlee0 Sangjin Lee added a comment -

          We definitely see this (usually coupled with localization failures) with PUBLIC resources. I don't have easy access to the scenarios at the moment, but will be able to provide it next week.

          Show
          sjlee0 Sangjin Lee added a comment - We definitely see this (usually coupled with localization failures) with PUBLIC resources. I don't have easy access to the scenarios at the moment, but will be able to provide it next week.
          Hide
          varun_saxena Varun Saxena added a comment -

          We still need to do this for APPLICATION resources. It is true that those resources will be cleaned up when the application finishes, but that could be hours or days later.

          Take your point. Will handle it.

          I think the cleanest approach is to make sure we send an event to the LocalizedResource when a container localizer

          Agree

          Will investigate further on PUBLIC resources by tomorrow and update if I find any flow where public resources can be orphaned.

          Show
          varun_saxena Varun Saxena added a comment - We still need to do this for APPLICATION resources. It is true that those resources will be cleaned up when the application finishes, but that could be hours or days later. Take your point. Will handle it. I think the cleanest approach is to make sure we send an event to the LocalizedResource when a container localizer Agree Will investigate further on PUBLIC resources by tomorrow and update if I find any flow where public resources can be orphaned.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch
          against trunk revision 1a68fc4.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6736//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6736//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch against trunk revision 1a68fc4. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6736//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6736//console This message is automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          We still need to do this for APPLICATION resources. It is true that those resources will be cleaned up when the application finishes, but that could be hours or days later. And as for PUBLIC resources, Sangjin confirmed earlier he's seen the orphaning occur with those resources as well, so it must be occurring somehow even for those. Sangjin Lee do you have any ideas on how PUBLIC resources ended up hung in a DOWNLOADING state? I'm wondering if this is specific to the shared cache setup or if there's a code path we're missing.

          I don't think we should special case the resource types to fix this. Again I think the cleanest approach is to make sure we send an event to the LocalizedResource when a container localizer (or maybe just the container itself) is killed, and let that state machine handle it appropriately (e.g.: try to remove the _tmp file if the resource was in the downloading state, ignore it if it's already localized, etc.).

          Show
          jlowe Jason Lowe added a comment - We still need to do this for APPLICATION resources. It is true that those resources will be cleaned up when the application finishes, but that could be hours or days later. And as for PUBLIC resources, Sangjin confirmed earlier he's seen the orphaning occur with those resources as well, so it must be occurring somehow even for those. Sangjin Lee do you have any ideas on how PUBLIC resources ended up hung in a DOWNLOADING state? I'm wondering if this is specific to the shared cache setup or if there's a code path we're missing. I don't think we should special case the resource types to fix this. Again I think the cleanest approach is to make sure we send an event to the LocalizedResource when a container localizer (or maybe just the container itself) is killed, and let that state machine handle it appropriately (e.g.: try to remove the _tmp file if the resource was in the downloading state, ignore it if it's already localized, etc.).
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, looked into it. Was able to simulate the issue as well for PRIVATE resources.
          I think we need to handle only for PRIVATE resources. APPLICATION resources will be cleaned up when application finishes. And PUBLIC resources should not remain orphaned as we do not kill or stop PublicLocalizer in between.

          To download the resource, FSDownload appends a _tmp at the end of the directory to which resource will be downloaded to.
          And while processing HB from Container Localizer, NM sends a destination path for the resource to be downloaded in response.
          We also download one resource at a time.

          So, we can store this destination path in a queue in LocalizerRunner whenever we are sending a new path for download and remove it when fetch is successful. When container is killed (which causes LocalizerRunner to be cleaned up) we can fetch the path from the front of the queue and submit the associated temp path for deletion to DeletionService, if ref count for the resource is 0.

          We cannot do this cleanup in ContainerLocalizer as LCE launches it as a new process and kills it when LocalizerRunner is interrupted.

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , looked into it. Was able to simulate the issue as well for PRIVATE resources. I think we need to handle only for PRIVATE resources. APPLICATION resources will be cleaned up when application finishes. And PUBLIC resources should not remain orphaned as we do not kill or stop PublicLocalizer in between. To download the resource, FSDownload appends a _tmp at the end of the directory to which resource will be downloaded to. And while processing HB from Container Localizer, NM sends a destination path for the resource to be downloaded in response. We also download one resource at a time. So, we can store this destination path in a queue in LocalizerRunner whenever we are sending a new path for download and remove it when fetch is successful. When container is killed (which causes LocalizerRunner to be cleaned up) we can fetch the path from the front of the queue and submit the associated temp path for deletion to DeletionService, if ref count for the resource is 0. We cannot do this cleanup in ContainerLocalizer as LCE launches it as a new process and kills it when LocalizerRunner is interrupted.
          Hide
          varun_saxena Varun Saxena added a comment -

          Yes...Sorry but have been keeping busy since last couple of weeks. Will update by this weekend.

          Show
          varun_saxena Varun Saxena added a comment - Yes...Sorry but have been keeping busy since last couple of weeks. Will update by this weekend.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Varun Saxena: are you still working on this jira ?

          Show
          shahrs87 Rushabh S Shah added a comment - Varun Saxena : are you still working on this jira ?
          Hide
          sjlee0 Sangjin Lee added a comment -

          I've seen this with PUBLIC resources too, the use case provided by the shared cache (YARN-1492).

          Show
          sjlee0 Sangjin Lee added a comment - I've seen this with PUBLIC resources too, the use case provided by the shared cache ( YARN-1492 ).
          Hide
          jlowe Jason Lowe added a comment -

          Container cleanup is called which deletes the container related directory which is used for localization.

          Cleaning up a container-related directory has no effect on cleaning up a distributed cache resource. Resources are not localized to a container-specific directory. They are either localized to a public directory (for PUBLIC resources), a user-specific directory (for PRIVATE resources), or an application-specific directory (for APPLICATION resources). Deleting the container directory will only delete symlinks to the resources – it will never delete the actual resources on disk. Also note that stopping the LocalizerRunner thread doesn't cleanup the disk state of resources in progress, so there's some extra work needed there as well.

          Was it a PUBLIC or PRIVATE resource which was left in the DOWNLOADING state ?

          I've definitely seen PRIVATE resources in this state, although it's possible there could have been some PUBLIC ones as well.

          Show
          jlowe Jason Lowe added a comment - Container cleanup is called which deletes the container related directory which is used for localization. Cleaning up a container-related directory has no effect on cleaning up a distributed cache resource. Resources are not localized to a container-specific directory. They are either localized to a public directory (for PUBLIC resources), a user-specific directory (for PRIVATE resources), or an application-specific directory (for APPLICATION resources). Deleting the container directory will only delete symlinks to the resources – it will never delete the actual resources on disk. Also note that stopping the LocalizerRunner thread doesn't cleanup the disk state of resources in progress, so there's some extra work needed there as well. Was it a PUBLIC or PRIVATE resource which was left in the DOWNLOADING state ? I've definitely seen PRIVATE resources in this state, although it's possible there could have been some PUBLIC ones as well.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, I looked into the code.

          When a container is killed during localizing state, Container cleanup is called which deletes the container related directory which is used for localization.
          The associated LocalizerRunner Thread is also stopped. Probably at this stage resource should move out of DOWNLOADING state as well.
          Refer to ResourceLocalizationService#handleCleanupContainerResources

          Was it a PUBLIC or PRIVATE resource which was left in the DOWNLOADING state ?

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , I looked into the code. When a container is killed during localizing state, Container cleanup is called which deletes the container related directory which is used for localization. The associated LocalizerRunner Thread is also stopped. Probably at this stage resource should move out of DOWNLOADING state as well. Refer to ResourceLocalizationService#handleCleanupContainerResources Was it a PUBLIC or PRIVATE resource which was left in the DOWNLOADING state ?
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks for the review Jason Lowe. Will look into how to delete file from file system.

          Show
          varun_saxena Varun Saxena added a comment - Thanks for the review Jason Lowe . Will look into how to delete file from file system.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the patch, Varun!

          I think the patch will prevent us from leaking the bookkeeping in resource trackers for resources in the downloading state, but it relies on the periodic retention checking and doesn't address the leaking of data on the disk. The localizer has probably created a partial *_tmp file/dir for the download that didn't complete, and we should be cleaning that up as well. As is we won't try to clean up any leaked DOWNLOADING resource until the retention process runs (on the order of tens of minutes), but we shouldn't need to wait around to reap resources that aren't really downloading.

          I haven't had time to work this all the way through, but I'm wondering if we're patching the symptoms rather than the root cause. The resource is lingering around in the DOWNLOADING state because a container was killed and we then "forgot" the corresponding localizer that was associated with the container. When the localizer later hearbeats in the NM tells the unknown localizer to DIE and that ultimately is what leads to a resource lingering around in the DOWNLOADING state. I think we should be properly cleaning up localizers corresponding to killed containers and sending appropriate events to the LocalizedResources. This will then cause the resources to transition out of the DOWNLOADING state to something appropriate, sending the proper events to any other containers that are pending on that resource. At that point we can also clean up any leaked _tmp files/dirs from the failed/killed localizer.

          Show
          jlowe Jason Lowe added a comment - Thanks for the patch, Varun! I think the patch will prevent us from leaking the bookkeeping in resource trackers for resources in the downloading state, but it relies on the periodic retention checking and doesn't address the leaking of data on the disk. The localizer has probably created a partial *_tmp file/dir for the download that didn't complete, and we should be cleaning that up as well. As is we won't try to clean up any leaked DOWNLOADING resource until the retention process runs (on the order of tens of minutes), but we shouldn't need to wait around to reap resources that aren't really downloading. I haven't had time to work this all the way through, but I'm wondering if we're patching the symptoms rather than the root cause. The resource is lingering around in the DOWNLOADING state because a container was killed and we then "forgot" the corresponding localizer that was associated with the container. When the localizer later hearbeats in the NM tells the unknown localizer to DIE and that ultimately is what leads to a resource lingering around in the DOWNLOADING state. I think we should be properly cleaning up localizers corresponding to killed containers and sending appropriate events to the LocalizedResources. This will then cause the resources to transition out of the DOWNLOADING state to something appropriate, sending the proper events to any other containers that are pending on that resource. At that point we can also clean up any leaked _tmp files/dirs from the failed/killed localizer.
          Hide
          varun_saxena Varun Saxena added a comment -

          Thanks Jason Lowe

          Show
          varun_saxena Varun Saxena added a comment - Thanks Jason Lowe
          Hide
          jlowe Jason Lowe added a comment -

          Sorry for the delay, Varun, as I was busy with end-of-year items and vacation. I'll try to get to this by the end of the week.

          Show
          jlowe Jason Lowe added a comment - Sorry for the delay, Varun, as I was busy with end-of-year items and vacation. I'll try to get to this by the end of the week.
          Hide
          varun_saxena Varun Saxena added a comment -

          Kindly review this. Pending since a long time.

          Show
          varun_saxena Varun Saxena added a comment - Kindly review this. Pending since a long time.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, kindly review the patch.

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , kindly review the patch.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch
          against trunk revision 8963515.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6031//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6031//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch against trunk revision 8963515. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6031//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6031//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch
          against trunk revision 1b3bb9e.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6026//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6026//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch against trunk revision 1b3bb9e. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6026//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6026//console This message is automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Looking further at the code, I think we can call LocalResourcesTrackerImpl#remove irrespective of whether cache target size decided by yarn.nodemanager.localizer.cache.target-size-mb config has been reached or not.
          This is because in below piece of code which is called from ResourceLocalizationService#handleCacheCleanup, we check the cumulative size of all the resources(which have reference count of 0) against the cache target size. And in case of resource having its state as DOWNLOADING, call to LocalizedResource#getSize will always return -1. Because it seems size is only updated once the state changes to LOCALIZED.

          ResourceRetentionSet.java
          public void addResources(LocalResourcesTracker newTracker) {
              for (LocalizedResource resource : newTracker) {
                currentSize += resource.getSize();
                if (resource.getRefCount() > 0) {
                  // always retain resources in use
                  continue;
                }
                retain.put(resource, newTracker);
              }
              for (Iterator<Map.Entry<LocalizedResource,LocalResourcesTracker>> i =
                     retain.entrySet().iterator();
                   currentSize - delSize > targetSize && i.hasNext();) {
                Map.Entry<LocalizedResource,LocalResourcesTracker> rsrc = i.next();
                LocalizedResource resource = rsrc.getKey();
                LocalResourcesTracker tracker = rsrc.getValue();
                if (tracker.remove(resource, delService)) {
                  delSize += resource.getSize();
                  i.remove();
                }
              }
            }
          
          Show
          varun_saxena Varun Saxena added a comment - Looking further at the code, I think we can call LocalResourcesTrackerImpl#remove irrespective of whether cache target size decided by yarn.nodemanager.localizer.cache.target-size-mb config has been reached or not. This is because in below piece of code which is called from ResourceLocalizationService#handleCacheCleanup, we check the cumulative size of all the resources(which have reference count of 0) against the cache target size. And in case of resource having its state as DOWNLOADING , call to LocalizedResource#getSize will always return -1. Because it seems size is only updated once the state changes to LOCALIZED . ResourceRetentionSet.java public void addResources(LocalResourcesTracker newTracker) { for (LocalizedResource resource : newTracker) { currentSize += resource.getSize(); if (resource.getRefCount() > 0) { // always retain resources in use continue ; } retain.put(resource, newTracker); } for (Iterator<Map.Entry<LocalizedResource,LocalResourcesTracker>> i = retain.entrySet().iterator(); currentSize - delSize > targetSize && i.hasNext();) { Map.Entry<LocalizedResource,LocalResourcesTracker> rsrc = i.next(); LocalizedResource resource = rsrc.getKey(); LocalResourcesTracker tracker = rsrc.getValue(); if (tracker.remove(resource, delService)) { delSize += resource.getSize(); i.remove(); } } }
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12685562/YARN-2902.patch
          against trunk revision e227fb8.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6024//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6024//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685562/YARN-2902.patch against trunk revision e227fb8. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6024//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6024//console This message is automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, kindly review.

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , kindly review.
          Hide
          jlowe Jason Lowe added a comment -

          This resource leak can be seen in the NM log when the cache cleaner runs its cycle and tries to delete it. The NM log message will look something like this:

          2014-11-20 23:34:49,331 [AsyncDispatcher event handler] ERROR localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://x:x/x/x/x, 1416461289970, FILE, null },pending,[],11086940691251746,DOWNLOADING} with non-zero refcount
          

          And that log message will continue appear during subsequent cache cleanup cycles.

          Show
          jlowe Jason Lowe added a comment - This resource leak can be seen in the NM log when the cache cleaner runs its cycle and tries to delete it. The NM log message will look something like this: 2014-11-20 23:34:49,331 [AsyncDispatcher event handler] ERROR localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://x:x/x/x/x, 1416461289970, FILE, null },pending,[],11086940691251746,DOWNLOADING} with non-zero refcount And that log message will continue appear during subsequent cache cleanup cycles.

            People

            • Assignee:
              varun_saxena Varun Saxena
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development