Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5195

RM intermittently crashed with NPE while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.2
    • Fix Version/s: 2.9.0, 3.0.0-alpha1
    • Component/s: resourcemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      While running gridmix experiments one time came across incident where RM went down with following exception

      2016-05-28 15:45:24,459 [ResourceManager Event Processor] FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
      java.lang.NullPointerException
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1282)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:860)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1319)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:704)
              at java.lang.Thread.run(Thread.java:745)
      2016-05-28 15:45:24,460 [ApplicationMasterLauncher #49] INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning master appattempt_1464449118385_0006_000001
      2016-05-28 15:45:24,460 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
      
      1. YARN-5195.01.patch
        1 kB
        sandflee
      2. YARN-5195.02.patch
        4 kB
        sandflee
      3. YARN-5195.03.patch
        4 kB
        sandflee

        Issue Links

          Activity

          Hide
          leftnoteasy Wangda Tan added a comment -

          Investigated this issue, this only happens when async scheduling enabled, container allocated to a node after the node removed from scheduler:

          Logs look like:

          2016-05-28 15:45:18,502 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1464449118385_0006_01_000324 of capacity <memory:2048, vCores:1> on host cn042-10.l42scl.hortonworks.com:49161, which currently has 0 containers, <memory:0, vCores:0> used and <memory:49152, vCores:12> available, release resources=true
          2016-05-28 15:45:18,503 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removed node node-1:49161 clusterResource: <memory:442368, vCores:108>
          2016-05-28 15:45:18,526 [Thread-12] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1464449118385_0006_01_000382 of capacity <memory:2048, vCores:1> on host node-1:49161, which has 1 containers, <memory:2048, vCores:1> used and <memory:47104, vCores:11> available after allocation
          

          Add additional lock protection to async scheduling thread could prevent this happen.

          Show
          leftnoteasy Wangda Tan added a comment - Investigated this issue, this only happens when async scheduling enabled, container allocated to a node after the node removed from scheduler: Logs look like: 2016-05-28 15:45:18,502 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1464449118385_0006_01_000324 of capacity <memory:2048, vCores:1> on host cn042-10.l42scl.hortonworks.com:49161, which currently has 0 containers, <memory:0, vCores:0> used and <memory:49152, vCores:12> available, release resources= true 2016-05-28 15:45:18,503 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removed node node-1:49161 clusterResource: <memory:442368, vCores:108> 2016-05-28 15:45:18,526 [ Thread -12] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1464449118385_0006_01_000382 of capacity <memory:2048, vCores:1> on host node-1:49161, which has 1 containers, <memory:2048, vCores:1> used and <memory:47104, vCores:11> available after allocation Add additional lock protection to async scheduling thread could prevent this happen.
          Hide
          leftnoteasy Wangda Tan added a comment -

          I don't have bandwidth to do this now, please feel free to pick it up if you have time.

          Show
          leftnoteasy Wangda Tan added a comment - I don't have bandwidth to do this now, please feel free to pick it up if you have time.
          Hide
          sandflee sandflee added a comment -

          AsyncSchedulerThread will copy all node from nodeTracker before attemptScheduling on node. there is a race condition:
          1, all nodes copied from nodeTracker
          2, nodeA lost and removed from scheduler, all launched containers are cleaned
          3, app attempt completed and the container allocated (or reserved) on nodeA will refer to non-exist node.
          this is fixed in fairscheduler in YARN-3675, add a init patch and will add a test later

          Show
          sandflee sandflee added a comment - AsyncSchedulerThread will copy all node from nodeTracker before attemptScheduling on node. there is a race condition: 1, all nodes copied from nodeTracker 2, nodeA lost and removed from scheduler, all launched containers are cleaned 3, app attempt completed and the container allocated (or reserved) on nodeA will refer to non-exist node. this is fixed in fairscheduler in YARN-3675 , add a init patch and will add a test later
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 9m 18s trunk passed
          +1 compile 0m 39s trunk passed
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 0m 44s trunk passed
          +1 mvneclipse 0m 19s trunk passed
          +1 findbugs 1m 7s trunk passed
          +1 javadoc 0m 23s trunk passed
          +1 mvninstall 0m 36s the patch passed
          +1 compile 0m 36s the patch passed
          +1 javac 0m 36s the patch passed
          +1 checkstyle 0m 21s the patch passed
          +1 mvnsite 0m 41s the patch passed
          +1 mvneclipse 0m 16s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 14s the patch passed
          +1 javadoc 0m 21s the patch passed
          -1 unit 37m 31s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          55m 51s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819231/YARN-5195.01.patch
          JIRA Issue YARN-5195
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux bced71b4d78d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 521f343
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12431/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12431/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12431/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12431/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 9m 18s trunk passed +1 compile 0m 39s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 44s trunk passed +1 mvneclipse 0m 19s trunk passed +1 findbugs 1m 7s trunk passed +1 javadoc 0m 23s trunk passed +1 mvninstall 0m 36s the patch passed +1 compile 0m 36s the patch passed +1 javac 0m 36s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 0m 41s the patch passed +1 mvneclipse 0m 16s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 14s the patch passed +1 javadoc 0m 21s the patch passed -1 unit 37m 31s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 55m 51s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819231/YARN-5195.01.patch JIRA Issue YARN-5195 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux bced71b4d78d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 521f343 Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12431/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12431/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12431/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12431/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 8m 15s trunk passed
          +1 compile 0m 36s trunk passed
          +1 checkstyle 0m 26s trunk passed
          +1 mvnsite 0m 46s trunk passed
          +1 mvneclipse 0m 19s trunk passed
          +1 findbugs 1m 3s trunk passed
          +1 javadoc 0m 26s trunk passed
          +1 mvninstall 0m 37s the patch passed
          +1 compile 0m 34s the patch passed
          +1 javac 0m 34s the patch passed
          -1 checkstyle 0m 25s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 291 unchanged - 0 fixed = 292 total (was 291)
          +1 mvnsite 0m 42s the patch passed
          +1 mvneclipse 0m 17s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 11s the patch passed
          +1 javadoc 0m 22s the patch passed
          -1 unit 33m 58s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          51m 19s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819430/YARN-5195.02.patch
          JIRA Issue YARN-5195
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 998c78924c17 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / ecff7d0
          Default Java 1.8.0_91
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12446/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12446/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 15s trunk passed +1 compile 0m 36s trunk passed +1 checkstyle 0m 26s trunk passed +1 mvnsite 0m 46s trunk passed +1 mvneclipse 0m 19s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 26s trunk passed +1 mvninstall 0m 37s the patch passed +1 compile 0m 34s the patch passed +1 javac 0m 34s the patch passed -1 checkstyle 0m 25s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 291 unchanged - 0 fixed = 292 total (was 291) +1 mvnsite 0m 42s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 11s the patch passed +1 javadoc 0m 22s the patch passed -1 unit 33m 58s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 51m 19s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819430/YARN-5195.02.patch JIRA Issue YARN-5195 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 998c78924c17 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ecff7d0 Default Java 1.8.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12446/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12446/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12446/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          sandflee sandflee added a comment -

          update a patch to fix checkstyle warning, failed test could pass locally, seems not related.

          Show
          sandflee sandflee added a comment - update a patch to fix checkstyle warning, failed test could pass locally, seems not related.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 25s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 50s trunk passed
          +1 compile 0m 32s trunk passed
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 0m 38s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 57s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 33s the patch passed
          +1 compile 0m 29s the patch passed
          +1 javac 0m 29s the patch passed
          +1 checkstyle 0m 21s the patch passed
          +1 mvnsite 0m 37s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 2s the patch passed
          +1 javadoc 0m 19s the patch passed
          -1 unit 36m 9s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          52m 4s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819522/YARN-5195.03.patch
          JIRA Issue YARN-5195
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ea77b5fb38ab 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 132deb4
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12453/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12453/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12453/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12453/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 50s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 57s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 33s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 0m 37s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 2s the patch passed +1 javadoc 0m 19s the patch passed -1 unit 36m 9s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 52m 4s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819522/YARN-5195.03.patch JIRA Issue YARN-5195 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ea77b5fb38ab 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 132deb4 Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12453/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12453/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12453/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12453/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          sunilg Sunil G added a comment -

          Hi sandflee
          Thanks for the patch. I have a doubt here.

          1 . all nodes copied from nodeTracker

          Since we copy all nodes from nodeTracker, we could loose one node any time during the allocation process. Currently the null check is added only at the start of allocateContainersToNode. So is it possible that we may loose node after this step too. Are we looking for lock here to avoid the problem, like an operating lock on node. Pls feel free to correct me if i understood the problem wrongly.

          Show
          sunilg Sunil G added a comment - Hi sandflee Thanks for the patch. I have a doubt here. 1 . all nodes copied from nodeTracker Since we copy all nodes from nodeTracker , we could loose one node any time during the allocation process. Currently the null check is added only at the start of allocateContainersToNode . So is it possible that we may loose node after this step too. Are we looking for lock here to avoid the problem, like an operating lock on node . Pls feel free to correct me if i understood the problem wrongly.
          Hide
          sandflee sandflee added a comment -

          Thanks Sunil G , nodeTracker#remove are invoked at Scheduler#removeNode, Scheduler#updateNodeResource, they are synced with scheduler#allocateContainersToNode, it's safe for now.

          Show
          sandflee sandflee added a comment - Thanks Sunil G , nodeTracker#remove are invoked at Scheduler#removeNode, Scheduler#updateNodeResource, they are synced with scheduler#allocateContainersToNode, it's safe for now.
          Hide
          sunilg Sunil G added a comment -

          Yes sandflee. That make sense.

          Show
          sunilg Sunil G added a comment - Yes sandflee . That make sense.
          Hide
          sunilg Sunil G added a comment -

          Patch looks fine for me. Thanks sandflee.

          Show
          sunilg Sunil G added a comment - Patch looks fine for me. Thanks sandflee .
          Hide
          leftnoteasy Wangda Tan added a comment -

          +1, thanks sandflee, will commit soon if no objections.

          Show
          leftnoteasy Wangda Tan added a comment - +1, thanks sandflee , will commit soon if no objections.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Committed to trunk and branch-2, thanks sandflee and reviews from Sunil G.

          Show
          leftnoteasy Wangda Tan added a comment - Committed to trunk and branch-2, thanks sandflee and reviews from Sunil G .
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #10161 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10161/)
          YARN-5195. RM intermittently crashed with NPE while handling (wangda: rev d62e121ffc0239e7feccc1e23ece92c5fac685f6)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10161 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10161/ ) YARN-5195 . RM intermittently crashed with NPE while handling (wangda: rev d62e121ffc0239e7feccc1e23ece92c5fac685f6) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          Hide
          sandflee sandflee added a comment -

          Thanks Wangda Tan and Sunil G for reviewing and committing.

          Show
          sandflee sandflee added a comment - Thanks Wangda Tan and Sunil G for reviewing and committing.

            People

            • Assignee:
              sandflee sandflee
              Reporter:
              karams Karam Singh
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development