Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6837

Null LocalResource visibility or resource type can crash the nodemanager

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha4
    • Fix Version/s: 2.9.0, 3.0.0-beta1, 2.8.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When I write an yarn application, I create a LocalResource like this

      LocalResource resource = Records.newRecord(LocalResource.class);

      Because I forget to set the visibilty of it, so the job is failed when I submit it.
      But NodeManager shutdown one by one at the same time, and there is NullPointerExceptionin NodeManager's log:

      2017-07-18 17:54:09,289 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop IP=10.43.156.177 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1499221670783_0067 CONTAINERID=container_1499221670783_0067_02_000003
      2017-07-18 17:54:09,292 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
      java.lang.NullPointerException
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.addResources(ResourceSet.java:84)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:868)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:819)
      at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
      at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1684)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:96)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1418)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1411)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
      at java.lang.Thread.run(Thread.java:745)
      2017-07-18 17:54:09,292 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1499221670783_0067_02_000002 by user hadoop

      Then I change my code and still set the visibility to null

      LocalResource resource = LocalResource.newInstance(
      URL.fromURI(dst.toUri()),
      LocalResourceType.FILE, null,
      fileStatus.getLen(), fileStatus.getModificationTime());

      This error still happen.

      At last I set the visibility to correct value, the error do not happen again.
      So I think the visibility of LocalResource is null will cause NodeManager shutdown.

      1. YARN-6837.patch
        2 kB
        Jinjiang Ling
      2. YARN-6837-1.patch
        12 kB
        Jinjiang Ling

        Issue Links

          Activity

          Hide
          lingjinjiang Jinjiang Ling added a comment -

          Attach a patch to avoid this error.

          Show
          lingjinjiang Jinjiang Ling added a comment - Attach a patch to avoid this error.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the report and the patch! Looking at the patch, I'm not a fan of letting an NPE occur then catching it and assuming we know where the NPE came from. It's error prone for maintenance since someone could accidentally introduce another NPE problem and then we are catching and suppressing for the wrong reason making things harder to debug.

          Speaking of repressing exceptions, this simply logs a warning when we have no visibility, but then it just continues. What will happen to the resource after that? It doesn't look like we add it to any localizer list and therefore I think the container will just hang waiting for a resource to localize that never will.

          A better way to handle this is to sanity-check the container launch request in ContainerManagerImpl#startContainerInternal and throw an exception if the request is malformed. This has the benefit of propagating the error back to the client who is making the bad request so they know both that the request was bad and the corresponding container will not be launched. This looks similar to YARN-6403, and the resource visibility was missed in that change.

          Show
          jlowe Jason Lowe added a comment - Thanks for the report and the patch! Looking at the patch, I'm not a fan of letting an NPE occur then catching it and assuming we know where the NPE came from. It's error prone for maintenance since someone could accidentally introduce another NPE problem and then we are catching and suppressing for the wrong reason making things harder to debug. Speaking of repressing exceptions, this simply logs a warning when we have no visibility, but then it just continues. What will happen to the resource after that? It doesn't look like we add it to any localizer list and therefore I think the container will just hang waiting for a resource to localize that never will. A better way to handle this is to sanity-check the container launch request in ContainerManagerImpl#startContainerInternal and throw an exception if the request is malformed. This has the benefit of propagating the error back to the client who is making the bad request so they know both that the request was bad and the corresponding container will not be launched. This looks similar to YARN-6403 , and the resource visibility was missed in that change.
          Hide
          lingjinjiang Jinjiang Ling added a comment -

          Jason Lowe Thanks for your suggestion.
          It seems that YARN-6403 is similar to this error. I also find that the resource type is null will cause this error, so I add resource visibility and type check after it.

          Update patch.

          Show
          lingjinjiang Jinjiang Ling added a comment - Jason Lowe Thanks for your suggestion. It seems that YARN-6403 is similar to this error. I also find that the resource type is null will cause this error, so I add resource visibility and type check after it. Update patch.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for updating the patch! Looks good to me, moving this to Patch Available so Jenkins can comment on the patch.

          Show
          jlowe Jason Lowe added a comment - Thanks for updating the patch! Looks good to me, moving this to Patch Available so Jenkins can comment on the patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                trunk Compile Tests
          0 mvndep 0m 11s Maven dependency ordering for branch
          +1 mvninstall 14m 38s trunk passed
          +1 compile 9m 38s trunk passed
          +1 checkstyle 0m 54s trunk passed
          +1 mvnsite 1m 14s trunk passed
          -1 findbugs 0m 48s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 5 extant Findbugs warnings.
          +1 javadoc 0m 57s trunk passed
                Patch Compile Tests
          0 mvndep 0m 11s Maven dependency ordering for patch
          +1 mvninstall 0m 52s the patch passed
          +1 compile 5m 23s the patch passed
          +1 javac 5m 23s the patch passed
          -0 checkstyle 0m 52s hadoop-yarn-project/hadoop-yarn: The patch generated 14 new + 172 unchanged - 0 fixed = 186 total (was 172)
          +1 mvnsite 1m 11s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 12s the patch passed
          +1 javadoc 1m 10s the patch passed
                Other Tests
          +1 unit 2m 42s hadoop-yarn-common in the patch passed.
          +1 unit 13m 44s hadoop-yarn-server-nodemanager in the patch passed.
          +1 asflicense 0m 33s The patch does not generate ASF License warnings.
          66m 16s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-6837
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12877952/YARN-6837-1.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 9d74dee2fe19 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / df18025
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-YARN-Build/16487/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/16487/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16487/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/16487/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       trunk Compile Tests 0 mvndep 0m 11s Maven dependency ordering for branch +1 mvninstall 14m 38s trunk passed +1 compile 9m 38s trunk passed +1 checkstyle 0m 54s trunk passed +1 mvnsite 1m 14s trunk passed -1 findbugs 0m 48s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 5 extant Findbugs warnings. +1 javadoc 0m 57s trunk passed       Patch Compile Tests 0 mvndep 0m 11s Maven dependency ordering for patch +1 mvninstall 0m 52s the patch passed +1 compile 5m 23s the patch passed +1 javac 5m 23s the patch passed -0 checkstyle 0m 52s hadoop-yarn-project/hadoop-yarn: The patch generated 14 new + 172 unchanged - 0 fixed = 186 total (was 172) +1 mvnsite 1m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 12s the patch passed +1 javadoc 1m 10s the patch passed       Other Tests +1 unit 2m 42s hadoop-yarn-common in the patch passed. +1 unit 13m 44s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 33s The patch does not generate ASF License warnings. 66m 16s Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-6837 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12877952/YARN-6837-1.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 9d74dee2fe19 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / df18025 Default Java 1.8.0_131 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/16487/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/16487/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16487/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn Console output https://builds.apache.org/job/PreCommit-YARN-Build/16487/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          The findbug warnings are unrelated to the patch.

          +1 lgtm. Committing this.

          Show
          jlowe Jason Lowe added a comment - The findbug warnings are unrelated to the patch. +1 lgtm. Committing this.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Jinjiang Ling! I committed this to trunk, branch-2, branch-2.8, and branch-2.8.2.

          Show
          jlowe Jason Lowe added a comment - Thanks, Jinjiang Ling ! I committed this to trunk, branch-2, branch-2.8, and branch-2.8.2.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12038 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12038/)
          YARN-6837. Null LocalResource visibility or resource type can crash the (jlowe: rev c8df3668ecc37c2d58cad35520a762eaec3c8539)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestApplicationClientProtocolRecords.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerLaunchContextPBImpl.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12038 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12038/ ) YARN-6837 . Null LocalResource visibility or resource type can crash the (jlowe: rev c8df3668ecc37c2d58cad35520a762eaec3c8539) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestApplicationClientProtocolRecords.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerLaunchContextPBImpl.java
          Hide
          lingjinjiang Jinjiang Ling added a comment -

          Jason Lowe, thanks for your review.

          Show
          lingjinjiang Jinjiang Ling added a comment - Jason Lowe , thanks for your review.

            People

            • Assignee:
              lingjinjiang Jinjiang Ling
              Reporter:
              lingjinjiang Jinjiang Ling
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development