Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4144

ResourceManager NPE while handling NODE_UPDATE

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.3, 2.0.2-alpha
    • Component/s: mrv2
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The RM on one of our clusters has exited twice in the past few days because of an NPE while trying to handle a NODE_UPDATE:

      2012-04-12 02:09:01,672 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
       [ResourceManager Event Processor]java.lang.NullPointerException
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:261)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:223)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp.allocate(SchedulerApp.java:246)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1229)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignNodeLocalContainers(LeafQueue.java:1078)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1048)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:859)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:756)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:573)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:622)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:78)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:302)
              at java.lang.Thread.run(Thread.java:619)
      

      This is very similar to the failure reported in MAPREDUCE-3005.

      1. MAPREDUCE-4144-testcase.patch
        5 kB
        Jason Lowe
      2. MAPREDUCE-4144.patch
        7 kB
        Jason Lowe

        Activity

        Hide
        Jason Lowe added a comment -

        Is the concern that with this change we won't remove the reservation or NODE_LOCAL request? This could still have happened in the case where the node doesn't free up sufficient resources before the application ends up finishing with containers on other nodes. Assuming the app doesn't complete first, I think the reservation will be cleaned up in assignReservedContainer() either because there are no more outstanding requests at the same priority or it will fill the reservation with an ANY request (since we know there aren't any more RACK_LOCAL requests in this scenario).

        But I might be misreading the code. If it's critical to allocate the reserved container as NODE_LOCAL once the node has enough free resources, we can undo this fix and put the rackLocal null check in AppSchedulingInfo.allocateNodeLocal.

        Show
        Jason Lowe added a comment - Is the concern that with this change we won't remove the reservation or NODE_LOCAL request? This could still have happened in the case where the node doesn't free up sufficient resources before the application ends up finishing with containers on other nodes. Assuming the app doesn't complete first, I think the reservation will be cleaned up in assignReservedContainer() either because there are no more outstanding requests at the same priority or it will fill the reservation with an ANY request (since we know there aren't any more RACK_LOCAL requests in this scenario). But I might be misreading the code. If it's critical to allocate the reserved container as NODE_LOCAL once the node has enough free resources, we can undo this fix and put the rackLocal null check in AppSchedulingInfo.allocateNodeLocal.
        Hide
        Jason Lowe added a comment -

        Do we want to allocate a node local container if there's no rack local to go with it? I thought the change in MAPREDUCE-3005 was specifically trying to avoid allocating in that case. Does the fact that the container is reserved change that somehow?

        Show
        Jason Lowe added a comment - Do we want to allocate a node local container if there's no rack local to go with it? I thought the change in MAPREDUCE-3005 was specifically trying to avoid allocating in that case. Does the fact that the container is reserved change that somehow?
        Hide
        Arun C Murthy added a comment -

        Sorry for coming late to this...

        I'm not sure this is the right fix, we could just handle the NPE with a null check in AppSchedulingInfo.allocateNodeLocal?

        Show
        Arun C Murthy added a comment - Sorry for coming late to this... I'm not sure this is the right fix, we could just handle the NPE with a null check in AppSchedulingInfo.allocateNodeLocal?
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1049 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1049/)
        MAPREDUCE-4144. Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991)

        Result = SUCCESS
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1049 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1049/ ) MAPREDUCE-4144 . Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1014 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1014/)
        MAPREDUCE-4144. Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991)

        Result = FAILURE
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1014 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1014/ ) MAPREDUCE-4144 . Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991) Result = FAILURE sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #227 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/227/)
        Merge MAPREDUCE-4144 from trunk. Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325994)

        Result = SUCCESS
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325994
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #227 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/227/ ) Merge MAPREDUCE-4144 from trunk. Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325994) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325994 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #2086 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2086/)
        MAPREDUCE-4144. Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991)

        Result = FAILURE
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #2086 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/2086/ ) MAPREDUCE-4144 . Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991) Result = FAILURE sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #2072 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2072/)
        MAPREDUCE-4144. Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991)

        Result = SUCCESS
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #2072 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/2072/ ) MAPREDUCE-4144 . Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #2145 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2145/)
        MAPREDUCE-4144. Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991)

        Result = SUCCESS
        sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #2145 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/2145/ ) MAPREDUCE-4144 . Fix a NPE in the ResourceManager when handling node updates. (Contributed by Jason Lowe) (Revision 1325991) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1325991 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
        Hide
        Siddharth Seth added a comment -

        Committed to trunk, branch-2 and branch-0.23

        Show
        Siddharth Seth added a comment - Committed to trunk, branch-2 and branch-0.23
        Hide
        Siddharth Seth added a comment -

        +1. lgtm. Thanks Jason.
        The same won't occur with off-switch requests since "*" seems to be a mandatory request - and is never removed from the request table.

        Show
        Siddharth Seth added a comment - +1. lgtm. Thanks Jason. The same won't occur with off-switch requests since "*" seems to be a mandatory request - and is never removed from the request table.
        Hide
        Jason Lowe added a comment -

        All of these test failures are unrelated to the patch. All the tests except TestAMAuthorization are failing due to socket bind errors (runaway processes on the test machine?). TestAMAuthorization is testing unrelated code, and the test passes for me when I run it locally.

        Show
        Jason Lowe added a comment - All of these test failures are unrelated to the patch. All the tests except TestAMAuthorization are failing due to socket bind errors (runaway processes on the test machine?). TestAMAuthorization is testing unrelated code, and the test passes for me when I run it locally.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12522497/MAPREDUCE-4144.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these unit tests:
        org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
        org.apache.hadoop.yarn.server.TestDiskFailures
        org.apache.hadoop.yarn.server.TestContainerManagerSecurity
        org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
        org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
        org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
        org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
        org.apache.hadoop.mapred.TestMiniMRClasspath
        org.apache.hadoop.mapreduce.v2.TestMRJobs
        org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers
        org.apache.hadoop.mapred.TestMiniMRBringup
        org.apache.hadoop.mapred.TestMiniMRChildTask
        org.apache.hadoop.mapred.TestReduceFetch
        org.apache.hadoop.mapred.TestClusterMRNotification
        org.apache.hadoop.mapred.TestReduceFetchFromPartialMem
        org.apache.hadoop.mapred.TestJobCounters
        org.apache.hadoop.mapreduce.TestChild
        org.apache.hadoop.mapred.TestMiniMRClientCluster
        org.apache.hadoop.ipc.TestSocketFactory
        org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService
        org.apache.hadoop.mapreduce.v2.TestMROldApiJobs
        org.apache.hadoop.mapreduce.v2.TestSpeculativeExecution
        org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter
        org.apache.hadoop.mapred.TestClientRedirect
        org.apache.hadoop.mapred.TestLazyOutput
        org.apache.hadoop.mapred.TestJobCleanup
        org.apache.hadoop.mapreduce.TestMapReduceLazyOutput
        org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath
        org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner
        org.apache.hadoop.conf.TestNoDefaultsJobConf
        org.apache.hadoop.mapreduce.v2.TestRMNMInfo
        org.apache.hadoop.mapred.TestClusterMapReduceTestCase
        org.apache.hadoop.mapreduce.v2.TestNonExistentJob
        org.apache.hadoop.mapred.TestJobSysDirWithDFS
        org.apache.hadoop.mapreduce.v2.TestUberAM
        org.apache.hadoop.mapreduce.v2.TestMiniMRProxyUser
        org.apache.hadoop.mapred.TestJobName
        org.apache.hadoop.mapreduce.security.TestJHSSecurity

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2216//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2216//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12522497/MAPREDUCE-4144.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 1 new or modified test files. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.server.TestDiskFailures org.apache.hadoop.yarn.server.TestContainerManagerSecurity org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs org.apache.hadoop.mapred.TestMiniMRClasspath org.apache.hadoop.mapreduce.v2.TestMRJobs org.apache.hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers org.apache.hadoop.mapred.TestMiniMRBringup org.apache.hadoop.mapred.TestMiniMRChildTask org.apache.hadoop.mapred.TestReduceFetch org.apache.hadoop.mapred.TestClusterMRNotification org.apache.hadoop.mapred.TestReduceFetchFromPartialMem org.apache.hadoop.mapred.TestJobCounters org.apache.hadoop.mapreduce.TestChild org.apache.hadoop.mapred.TestMiniMRClientCluster org.apache.hadoop.ipc.TestSocketFactory org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService org.apache.hadoop.mapreduce.v2.TestMROldApiJobs org.apache.hadoop.mapreduce.v2.TestSpeculativeExecution org.apache.hadoop.mapreduce.lib.output.TestJobOutputCommitter org.apache.hadoop.mapred.TestClientRedirect org.apache.hadoop.mapred.TestLazyOutput org.apache.hadoop.mapred.TestJobCleanup org.apache.hadoop.mapreduce.TestMapReduceLazyOutput org.apache.hadoop.mapred.TestSpecialCharactersInOutputPath org.apache.hadoop.mapreduce.v2.TestMRAppWithCombiner org.apache.hadoop.conf.TestNoDefaultsJobConf org.apache.hadoop.mapreduce.v2.TestRMNMInfo org.apache.hadoop.mapred.TestClusterMapReduceTestCase org.apache.hadoop.mapreduce.v2.TestNonExistentJob org.apache.hadoop.mapred.TestJobSysDirWithDFS org.apache.hadoop.mapreduce.v2.TestUberAM org.apache.hadoop.mapreduce.v2.TestMiniMRProxyUser org.apache.hadoop.mapred.TestJobName org.apache.hadoop.mapreduce.security.TestJHSSecurity +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2216//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2216//console This message is automatically generated.
        Hide
        Jason Lowe added a comment -

        Patch that avoids the early-out in canAssign() for a reserved container if we're not doing OFF-SWITCH.

        Not sure this is the best fix, as we could still have a reserved container on this node that has already been handled by another node. But at least we won't crash the RM when that happens.

        Show
        Jason Lowe added a comment - Patch that avoids the early-out in canAssign() for a reserved container if we're not doing OFF-SWITCH. Not sure this is the best fix, as we could still have a reserved container on this node that has already been handled by another node. But at least we won't crash the RM when that happens.
        Hide
        Jason Lowe added a comment -

        I think the fix for MAPREDUCE-3005 is being skipped due to reserved containers. Here's the scenario:

        1. Application has some requests that are NODE_LOCAL and some others that are ANY.
        2. Node A heartbeats in and we try to schedule the NODE_LOCAL request on it, but there are no available containers and instead we make a reservation.
        3. Node B heartbeats in and it's on the same rack as Node A, so we fulfill the corresponding RACK_LOCAL request that went with Node A's NODE_LOCAL request.
        4. Node A heartbeats in with some spare containers, and we skip the MAPREDUCE-3005 fix in canAssign() because there is a reserved container on this node. Since the RACK_LOCAL request was removed when we assigned it to Node B, we crash because we assume all NODE_LOCAL requests will have a corresponding RACK_LOCAL request.

        I checked the RM log above the crash, and I did find indications of container reservations being in play. For example:

         [ResourceManager Event Processor]2012-04-12 02:09:01,671 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Trying to fulfill re
        servation for application application_1334157153376_0281 on node: xxx:8041
         [ResourceManager Event Processor]2012-04-12 02:09:01,671 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp: Application application_1334157153
        376_0281 unreserved  on node host: xxx:8041 #containers=3 available=7680 used=13824, currently has 70 at priority org.apache.hadoop.yarn.api
        .records.impl.pb.PriorityPBImpl@33; currentReservation memory: 322560
        

        Attached is a testcase that reproduces the NPE crash with the same backtrace.

        Show
        Jason Lowe added a comment - I think the fix for MAPREDUCE-3005 is being skipped due to reserved containers. Here's the scenario: Application has some requests that are NODE_LOCAL and some others that are ANY. Node A heartbeats in and we try to schedule the NODE_LOCAL request on it, but there are no available containers and instead we make a reservation. Node B heartbeats in and it's on the same rack as Node A, so we fulfill the corresponding RACK_LOCAL request that went with Node A's NODE_LOCAL request. Node A heartbeats in with some spare containers, and we skip the MAPREDUCE-3005 fix in canAssign() because there is a reserved container on this node. Since the RACK_LOCAL request was removed when we assigned it to Node B, we crash because we assume all NODE_LOCAL requests will have a corresponding RACK_LOCAL request. I checked the RM log above the crash, and I did find indications of container reservations being in play. For example: [ResourceManager Event Processor]2012-04-12 02:09:01,671 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Trying to fulfill re servation for application application_1334157153376_0281 on node: xxx:8041 [ResourceManager Event Processor]2012-04-12 02:09:01,671 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp: Application application_1334157153 376_0281 unreserved on node host: xxx:8041 #containers=3 available=7680 used=13824, currently has 70 at priority org.apache.hadoop.yarn.api .records.impl.pb.PriorityPBImpl@33; currentReservation memory: 322560 Attached is a testcase that reproduces the NPE crash with the same backtrace.

          People

          • Assignee:
            Jason Lowe
            Reporter:
            Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development