Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4354

Public resource localization fails with NPE

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      I saw public localization on nodemanagers get stuck because it was constantly rejecting requests to the thread pool executor.

      1. YARN-4354.001.patch
        6 kB
        Jason Lowe
      2. YARN-4354.002.patch
        7 kB
        Jason Lowe
      3. YARN-4354-branch-2.7.002.patch
        7 kB
        Jason Lowe
      4. YARN-4354-unittest.patch
        3 kB
        Jason Lowe

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          Sample stacktrace:

          java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ExecutorCompletionService$QueueingFuture@4e2b9db3 rejected from java.util.concurrent.ThreadPoolExecutor@467b3667[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 102]
                  at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
                  at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
                  at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
                  at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:816)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:704)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:646)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
                  at java.lang.Thread.run(Thread.java:745)
          

          The threadpool was shutdown due to an earlier NPE that occurred:

          2015-11-13 02:09:56,944 [Public Localizer] FATAL localizer.ResourceLocalizationService: Error: Shutting down
          java.lang.NullPointerException
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:174)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:56)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:862)
          2015-11-13 02:09:56,944 [Public Localizer] INFO localizer.ResourceLocalizationService: Public cache exiting
          
          Show
          jlowe Jason Lowe added a comment - Sample stacktrace: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ExecutorCompletionService$QueueingFuture@4e2b9db3 rejected from java.util.concurrent.ThreadPoolExecutor@467b3667[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 102] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:816) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:704) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:646) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) at java.lang.Thread.run(Thread.java:745) The threadpool was shutdown due to an earlier NPE that occurred: 2015-11-13 02:09:56,944 [Public Localizer] FATAL localizer.ResourceLocalizationService: Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:174) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:56) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:862) 2015-11-13 02:09:56,944 [Public Localizer] INFO localizer.ResourceLocalizationService: Public cache exiting
          Hide
          jlowe Jason Lowe added a comment -

          I believe this was caused by YARN-2902. A resource was just localized, but the resource is missing. That normally doesn't occur. However after YARN-2902 a resource can be yanked out while it is still downloading if a container releases it and the refcount is zero. So if a public resource is requested by a container but killed before the localization completes then we can get a localized event for a missing resource and hit the NPE.

          We should not be removing a resource if the localization will still complete, otherwise we not only risk the NPE but also leaking the local files.

          Show
          jlowe Jason Lowe added a comment - I believe this was caused by YARN-2902 . A resource was just localized, but the resource is missing. That normally doesn't occur. However after YARN-2902 a resource can be yanked out while it is still downloading if a container releases it and the refcount is zero. So if a public resource is requested by a container but killed before the localization completes then we can get a localized event for a missing resource and hit the NPE. We should not be removing a resource if the localization will still complete, otherwise we not only risk the NPE but also leaking the local files.
          Hide
          jlowe Jason Lowe added a comment -

          Attaching a unit test I hacked together that demonstrates the problem as I understand it.

          Show
          jlowe Jason Lowe added a comment - Attaching a unit test I hacked together that demonstrates the problem as I understand it.
          Hide
          varun_saxena Varun Saxena added a comment -

          Jason Lowe, I think you are correct. Below code added in YARN-2902 causes the problem.
          Public Localizer will continue downloading the resource unlike localizer for private resources which exits because a DIE is issued.
          As you said, because of below addition resource is removed if reference count is 0, but for a PUBLIC resource a LOCALIZED Event may come even after container has been killed. This wont happen for private resources though.

              // Remove the resource if its downloading and its reference count has
              // become 0 after RELEASE. This maybe because a container was killed while
              // localizing and no other container is referring to the resource.
              if (event.getType() == ResourceEventType.RELEASE) {
                if (rsrc.getState() == ResourceState.DOWNLOADING &&
                    rsrc.getRefCount() <= 0) {
                  removeResource(req);
                }
              }
          

          I think a check for resource visibility should suffice. What do you think ?

          Show
          varun_saxena Varun Saxena added a comment - Jason Lowe , I think you are correct. Below code added in YARN-2902 causes the problem. Public Localizer will continue downloading the resource unlike localizer for private resources which exits because a DIE is issued. As you said, because of below addition resource is removed if reference count is 0, but for a PUBLIC resource a LOCALIZED Event may come even after container has been killed. This wont happen for private resources though. // Remove the resource if its downloading and its reference count has // become 0 after RELEASE. This maybe because a container was killed while // localizing and no other container is referring to the resource. if (event.getType() == ResourceEventType.RELEASE) { if (rsrc.getState() == ResourceState.DOWNLOADING && rsrc.getRefCount() <= 0) { removeResource(req); } } I think a check for resource visibility should suffice. What do you think ?
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Marking this as a blocker for 2.7.2.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Marking this as a blocker for 2.7.2.
          Hide
          jlowe Jason Lowe added a comment -

          Looks like this can cause nodemanagers to crash as well:

          2015-11-13 17:22:51,063 [AsyncDispatcher event handler] FATAL event.AsyncDispatcher: Error in dispatcher thread
          java.lang.NullPointerException
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.getPathForLocalization(LocalResourcesTrackerImpl.java:448)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:802)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:704)
                  at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:646)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
                  at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
                  at java.lang.Thread.run(Thread.java:745)
          

          I think it was trying to lookup a resource that it assumed was still there but had been removed.

          I think a check for resource visibility should suffice. What do you think ?

          What worries me about that approach is if we somehow allowed a heartbeat from a localizer to come in just after we cleaned up a resource because a container happened to be released then we get the same kind of badness if the localization completed just after we removed it. We may still want a null check just in case we get a late event for a removed resource.

          Show
          jlowe Jason Lowe added a comment - Looks like this can cause nodemanagers to crash as well: 2015-11-13 17:22:51,063 [AsyncDispatcher event handler] FATAL event.AsyncDispatcher: Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.getPathForLocalization(LocalResourcesTrackerImpl.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:802) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:704) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:646) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109) at java.lang.Thread.run(Thread.java:745) I think it was trying to lookup a resource that it assumed was still there but had been removed. I think a check for resource visibility should suffice. What do you think ? What worries me about that approach is if we somehow allowed a heartbeat from a localizer to come in just after we cleaned up a resource because a container happened to be released then we get the same kind of badness if the localization completed just after we removed it. We may still want a null check just in case we get a late event for a removed resource.
          Hide
          varun_saxena Varun Saxena added a comment -

          Makes sense. Better to be safe and have an explicit null check in addition to check for public resource.
          We can print a log message if rsrc is not found.

          Show
          varun_saxena Varun Saxena added a comment - Makes sense. Better to be safe and have an explicit null check in addition to check for public resource. We can print a log message if rsrc is not found.
          Hide
          jlowe Jason Lowe added a comment -

          Patch that implements the visibility check to avoid executing the logic for public resources along with the null resource check just in case somehow we get a late event for a resource that was already removed.

          Show
          jlowe Jason Lowe added a comment - Patch that implements the visibility check to avoid executing the logic for public resources along with the null resource check just in case somehow we get a late event for a resource that was already removed.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 5s docker + precommit patch detected.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 3m 0s trunk passed
          +1 compile 0m 22s trunk passed with JDK v1.8.0_60
          +1 compile 0m 23s trunk passed with JDK v1.7.0_79
          +1 checkstyle 0m 10s trunk passed
          +1 mvnsite 0m 25s trunk passed
          +1 mvneclipse 0m 12s trunk passed
          +1 findbugs 0m 56s trunk passed
          +1 javadoc 0m 18s trunk passed with JDK v1.8.0_60
          +1 javadoc 0m 22s trunk passed with JDK v1.7.0_79
          +1 mvninstall 0m 23s the patch passed
          +1 compile 0m 21s the patch passed with JDK v1.8.0_60
          +1 javac 0m 21s the patch passed
          +1 compile 0m 23s the patch passed with JDK v1.7.0_79
          +1 javac 0m 23s the patch passed
          +1 checkstyle 0m 10s the patch passed
          +1 mvnsite 0m 24s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 3s the patch passed
          +1 javadoc 0m 17s the patch passed with JDK v1.8.0_60
          +1 javadoc 0m 24s the patch passed with JDK v1.7.0_79
          -1 unit 8m 35s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_60.
          -1 unit 9m 0s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.7.0_79.
          +1 asflicense 0m 21s Patch does not generate ASF License warnings.
          28m 48s



          Reason Tests
          JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
          JDK v1.7.0_79 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService



          Subsystem Report/Notes
          Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-13
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12772275/YARN-4354.001.patch
          JIRA Issue YARN-4354
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 5d9136129aa2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-fa12328/precommit/personality/hadoop.sh
          git revision trunk / f94d892
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9684/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_60.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9684/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.7.0_79.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9684/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-YARN-Build/9684/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.7.0_79.txt
          JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9684/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Max memory used 227MB
          Powered by Apache Yetus http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9684/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 5s docker + precommit patch detected. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 3m 0s trunk passed +1 compile 0m 22s trunk passed with JDK v1.8.0_60 +1 compile 0m 23s trunk passed with JDK v1.7.0_79 +1 checkstyle 0m 10s trunk passed +1 mvnsite 0m 25s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 0m 56s trunk passed +1 javadoc 0m 18s trunk passed with JDK v1.8.0_60 +1 javadoc 0m 22s trunk passed with JDK v1.7.0_79 +1 mvninstall 0m 23s the patch passed +1 compile 0m 21s the patch passed with JDK v1.8.0_60 +1 javac 0m 21s the patch passed +1 compile 0m 23s the patch passed with JDK v1.7.0_79 +1 javac 0m 23s the patch passed +1 checkstyle 0m 10s the patch passed +1 mvnsite 0m 24s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 3s the patch passed +1 javadoc 0m 17s the patch passed with JDK v1.8.0_60 +1 javadoc 0m 24s the patch passed with JDK v1.7.0_79 -1 unit 8m 35s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_60. -1 unit 9m 0s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.7.0_79. +1 asflicense 0m 21s Patch does not generate ASF License warnings. 28m 48s Reason Tests JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService JDK v1.7.0_79 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService Subsystem Report/Notes Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-13 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12772275/YARN-4354.001.patch JIRA Issue YARN-4354 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 5d9136129aa2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-fa12328/precommit/personality/hadoop.sh git revision trunk / f94d892 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/9684/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_60.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/9684/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.7.0_79.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9684/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-YARN-Build/9684/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.7.0_79.txt JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9684/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Max memory used 227MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9684/console This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Fix for unit test failure.

          Show
          jlowe Jason Lowe added a comment - Fix for unit test failure.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s docker + precommit patch detected.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 3m 56s trunk passed
          +1 compile 0m 33s trunk passed with JDK v1.8.0_60
          +1 compile 0m 28s trunk passed with JDK v1.7.0_79
          +1 checkstyle 0m 13s trunk passed
          +1 mvnsite 0m 30s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 1m 9s trunk passed
          +1 javadoc 0m 26s trunk passed with JDK v1.8.0_60
          +1 javadoc 0m 28s trunk passed with JDK v1.7.0_79
          +1 mvninstall 0m 28s the patch passed
          +1 compile 0m 28s the patch passed with JDK v1.8.0_60
          +1 javac 0m 28s the patch passed
          +1 compile 0m 28s the patch passed with JDK v1.7.0_79
          +1 javac 0m 28s the patch passed
          +1 checkstyle 0m 12s the patch passed
          +1 mvnsite 0m 29s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 18s the patch passed
          +1 javadoc 0m 26s the patch passed with JDK v1.8.0_60
          +1 javadoc 0m 27s the patch passed with JDK v1.7.0_79
          -1 unit 9m 48s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_60.
          +1 unit 9m 45s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_79.
          +1 asflicense 0m 28s Patch does not generate ASF License warnings.
          33m 59s



          Reason Tests
          JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService



          Subsystem Report/Notes
          Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-13
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12772286/YARN-4354.002.patch
          JIRA Issue YARN-4354
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 225cbeefebd4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-fa12328/precommit/personality/hadoop.sh
          git revision trunk / f94d892
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9685/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_60.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9685/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_60.txt
          JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9685/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Max memory used 229MB
          Powered by Apache Yetus http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9685/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s docker + precommit patch detected. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 3m 56s trunk passed +1 compile 0m 33s trunk passed with JDK v1.8.0_60 +1 compile 0m 28s trunk passed with JDK v1.7.0_79 +1 checkstyle 0m 13s trunk passed +1 mvnsite 0m 30s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 9s trunk passed +1 javadoc 0m 26s trunk passed with JDK v1.8.0_60 +1 javadoc 0m 28s trunk passed with JDK v1.7.0_79 +1 mvninstall 0m 28s the patch passed +1 compile 0m 28s the patch passed with JDK v1.8.0_60 +1 javac 0m 28s the patch passed +1 compile 0m 28s the patch passed with JDK v1.7.0_79 +1 javac 0m 28s the patch passed +1 checkstyle 0m 12s the patch passed +1 mvnsite 0m 29s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 18s the patch passed +1 javadoc 0m 26s the patch passed with JDK v1.8.0_60 +1 javadoc 0m 27s the patch passed with JDK v1.7.0_79 -1 unit 9m 48s hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_60. +1 unit 9m 45s hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_79. +1 asflicense 0m 28s Patch does not generate ASF License warnings. 33m 59s Reason Tests JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService Subsystem Report/Notes Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-13 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12772286/YARN-4354.002.patch JIRA Issue YARN-4354 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 225cbeefebd4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-fa12328/precommit/personality/hadoop.sh git revision trunk / f94d892 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/9685/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_60.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9685/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_60.txt JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9685/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Max memory used 229MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9685/console This message was automatically generated.
          Hide
          varun_saxena Varun Saxena added a comment -

          +1, patch LGTM

          Show
          varun_saxena Varun Saxena added a comment - +1, patch LGTM
          Hide
          eepayne Eric Payne added a comment -

          +1

          Thanks Jason for catching and fixing this.

          I also verified that the new test (TestLocalResourcesTrackerImpl#testReleaseWhileDownloading) passes with the fix and NPEs without it.

          And, I ran TestResourceLocalizationService (the above test that is failing) in my local build environment and it passes for me.

          Show
          eepayne Eric Payne added a comment - +1 Thanks Jason for catching and fixing this. I also verified that the new test ( TestLocalResourcesTrackerImpl#testReleaseWhileDownloading ) passes with the fix and NPEs without it. And, I ran TestResourceLocalizationService (the above test that is failing) in my local build environment and it passes for me.
          Hide
          brahmareddy Brahma Reddy Battula added a comment -

          Jason Lowe Nice catch.+1 (non-binding).

          Show
          brahmareddy Brahma Reddy Battula added a comment - Jason Lowe Nice catch.+1 (non-binding).
          Hide
          djp Junping Du added a comment -

          +1. Patch LGTM. Will commit it shortly.

          Looks like this can cause nodemanagers to crash as well.

          To make NM more robust, I think we should tolerate this kind of failure/exception in LocalResourcesTracker rather than making NM's dispatch to crash and exit. May be we can make LocalResourcesTracker have a separated AsyncDispatcher to set "DISPATCHER_EXIT_ON_ERROR_KEY" to false like what we do in RM for SchedulerEventDispatcher?

          Show
          djp Junping Du added a comment - +1. Patch LGTM. Will commit it shortly. Looks like this can cause nodemanagers to crash as well. To make NM more robust, I think we should tolerate this kind of failure/exception in LocalResourcesTracker rather than making NM's dispatch to crash and exit. May be we can make LocalResourcesTracker have a separated AsyncDispatcher to set "DISPATCHER_EXIT_ON_ERROR_KEY" to false like what we do in RM for SchedulerEventDispatcher?
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8805 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8805/)
          YARN-4354. Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8805 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8805/ ) YARN-4354 . Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/CHANGES.txt
          Hide
          djp Junping Du added a comment -

          I have commit the 002 patch to trunk, branch-2 and branch-2.7. Thanks Jason Lowe for patch and Varun, Brahma and Eric for review!

          Show
          djp Junping Du added a comment - I have commit the 002 patch to trunk, branch-2 and branch-2.7. Thanks Jason Lowe for patch and Varun, Brahma and Eric for review!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #1404 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1404/)
          YARN-4354. Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #1404 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1404/ ) YARN-4354 . Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #668 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/668/)
          YARN-4354. Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #668 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/668/ ) YARN-4354 . Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #680 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/680/)
          YARN-4354. Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #680 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/680/ ) YARN-4354 . Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2609 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2609/)
          YARN-4354. Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2609 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2609/ ) YARN-4354 . Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #607 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/607/)
          YARN-4354. Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #607 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/607/ ) YARN-4354 . Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2544 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2544/)
          YARN-4354. Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2544 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2544/ ) YARN-4354 . Public resource localization fails with NPE. Contributed by (junping_du: rev 855d52927b6115e2cfbd97a94d6c1a3ddf0e94bb) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
          Hide
          jlowe Jason Lowe added a comment -

          To make NM more robust, I think we should tolerate this kind of failure/exception in LocalResourcesTracker rather than making NM's dispatch to crash and exit. May be we can make LocalResourcesTracker have a separated AsyncDispatcher to set "DISPATCHER_EXIT_ON_ERROR_KEY" to false like what we do in RM for SchedulerEventDispatcher?

          I don't think there's anything magical about localization vs. the other things the NM is doing. The async dispatcher will only exit if an exception leaks up to the top, and when it does that's a programming error since it doesn't handle an exception properly. If we're willing for NPEs in localization to not take down the NM, why are we willing to do the same if it happens in another NM subsystem that also uses the AsyncDispatcher? IMHO we should be consistent about the unexpected exception handling.

          Show
          jlowe Jason Lowe added a comment - To make NM more robust, I think we should tolerate this kind of failure/exception in LocalResourcesTracker rather than making NM's dispatch to crash and exit. May be we can make LocalResourcesTracker have a separated AsyncDispatcher to set "DISPATCHER_EXIT_ON_ERROR_KEY" to false like what we do in RM for SchedulerEventDispatcher? I don't think there's anything magical about localization vs. the other things the NM is doing. The async dispatcher will only exit if an exception leaks up to the top, and when it does that's a programming error since it doesn't handle an exception properly. If we're willing for NPEs in localization to not take down the NM, why are we willing to do the same if it happens in another NM subsystem that also uses the AsyncDispatcher? IMHO we should be consistent about the unexpected exception handling.
          Hide
          jlowe Jason Lowe added a comment -

          The commit to branch-2.7 broke the build because the LocalResourcesTrackerImpl constructor arguments are different than in branch-2. Attached is the version of the patch I committed to branch-2.7.

          Show
          jlowe Jason Lowe added a comment - The commit to branch-2.7 broke the build because the LocalResourcesTrackerImpl constructor arguments are different than in branch-2. Attached is the version of the patch I committed to branch-2.7.
          Hide
          djp Junping Du added a comment -

          Sorry for making a mistake. I was paying more attentions to other conflicts rather than this change...
          Thanks Jason Lowe for fixing this.

          Show
          djp Junping Du added a comment - Sorry for making a mistake. I was paying more attentions to other conflicts rather than this change... Thanks Jason Lowe for fixing this.
          Hide
          djp Junping Du added a comment -

          I don't think there's anything magical about localization vs. the other things the NM is doing. The async dispatcher will only exit if an exception leaks up to the top, and when it does that's a programming error since it doesn't handle an exception properly.

          I agree there are no much different in overall. However, back to this case: from a user's prospective, an occasional NPE localization exception for a resource being cancelled could be better to be ignored (but get logged) rather than crash the NM. The price of ignoring the exception here could be potentially leaking file half localized (could be removed later) but the gain is the NM can be survival and keep working. We should at least provide this trade-off as a configurable choice to user. Isn't it?

          If we're willing for NPEs in localization to not take down the NM, why are we willing to do the same if it happens in another NM subsystem that also uses the AsyncDispatcher? IMHO we should be consistent about the unexpected exception handling.

          I am not against to keep consistent for localization event handling with other subsystems, but not sure if ignoring other exceptional events could potentially cause NM ends up in a bad state. I think that is motivation we separate SchedulerEventDispatcher from RM dispatcher for general events with different setting/behavior. No?

          Show
          djp Junping Du added a comment - I don't think there's anything magical about localization vs. the other things the NM is doing. The async dispatcher will only exit if an exception leaks up to the top, and when it does that's a programming error since it doesn't handle an exception properly. I agree there are no much different in overall. However, back to this case: from a user's prospective, an occasional NPE localization exception for a resource being cancelled could be better to be ignored (but get logged) rather than crash the NM. The price of ignoring the exception here could be potentially leaking file half localized (could be removed later) but the gain is the NM can be survival and keep working. We should at least provide this trade-off as a configurable choice to user. Isn't it? If we're willing for NPEs in localization to not take down the NM, why are we willing to do the same if it happens in another NM subsystem that also uses the AsyncDispatcher? IMHO we should be consistent about the unexpected exception handling. I am not against to keep consistent for localization event handling with other subsystems, but not sure if ignoring other exceptional events could potentially cause NM ends up in a bad state. I think that is motivation we separate SchedulerEventDispatcher from RM dispatcher for general events with different setting/behavior. No?
          Hide
          jlowe Jason Lowe added a comment -

          I committed the 2.7 patch to branch-2.7.2 as well, since it was missing from that release branch.

          I am not against to keep consistent for localization event handling with other subsystems, but not sure if ignoring other exceptional events could potentially cause NM ends up in a bad state.

          From my perspective, any escaped exception at the Async Dispatcher level is capable of leaving the NM in a bad state. Since it's escaped we don't know where it occurred and what we were trying to do at the time. That's why I think it's a bit dangerous to assume the decisions we will make from that bad state are better than crashing. Anyway if we want to do this then we should take up the discussion in a JIRA targeting that feature.

          Show
          jlowe Jason Lowe added a comment - I committed the 2.7 patch to branch-2.7.2 as well, since it was missing from that release branch. I am not against to keep consistent for localization event handling with other subsystems, but not sure if ignoring other exceptional events could potentially cause NM ends up in a bad state. From my perspective, any escaped exception at the Async Dispatcher level is capable of leaving the NM in a bad state. Since it's escaped we don't know where it occurred and what we were trying to do at the time. That's why I think it's a bit dangerous to assume the decisions we will make from that bad state are better than crashing. Anyway if we want to do this then we should take up the discussion in a JIRA targeting that feature.
          Hide
          djp Junping Du added a comment -

          Given YARN-2902 is just committed into 2.6 branch, I have cherry-pick this patch into branch-2.6 also.

          Show
          djp Junping Du added a comment - Given YARN-2902 is just committed into 2.6 branch, I have cherry-pick this patch into branch-2.6 also.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9060 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9060/)
          Add YARN-2975, YARN-3893, YARN-2902 and YARN-4354 to Release 2.6.4 entry (junping_du: rev b6c9d3fab9c76b03abd664858f64a4ebf3c2bb20)

          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9060 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9060/ ) Add YARN-2975 , YARN-3893 , YARN-2902 and YARN-4354 to Release 2.6.4 entry (junping_du: rev b6c9d3fab9c76b03abd664858f64a4ebf3c2bb20) hadoop-yarn-project/CHANGES.txt

            People

            • Assignee:
              jlowe Jason Lowe
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development