Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3790

usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Failure trace is as follows

      Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
      testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart)  Time elapsed: 6.502 sec  <<< FAILURE!
      java.lang.AssertionError: expected:<6144> but was:<8192>
      	at org.junit.Assert.fail(Assert.java:88)
      	at org.junit.Assert.failNotEquals(Assert.java:743)
      	at org.junit.Assert.assertEquals(Assert.java:118)
      	at org.junit.Assert.assertEquals(Assert.java:555)
      	at org.junit.Assert.assertEquals(Assert.java:542)
      	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853)
      	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342)
      	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241)
      

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/)
          YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
          • hadoop-yarn-project/CHANGES.txt
            Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52)
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2185 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2185/ ) YARN-3790 . usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java hadoop-yarn-project/CHANGES.txt Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/)
          YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
          • hadoop-yarn-project/CHANGES.txt
            Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52)
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #237 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/237/ ) YARN-3790 . usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java hadoop-yarn-project/CHANGES.txt Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/)
          YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
            Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52)
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #228 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/228/ ) YARN-3790 . usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/)
          YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
          • hadoop-yarn-project/CHANGES.txt
            Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52)
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2167 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2167/ ) YARN-3790 . usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java hadoop-yarn-project/CHANGES.txt Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/969/)
          YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
            Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52)
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #969 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/969/ ) YARN-3790 . usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/)
          YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
          • hadoop-yarn-project/CHANGES.txt
            Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52)
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #239 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/239/ ) YARN-3790 . usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java hadoop-yarn-project/CHANGES.txt Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8060 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8060/)
          Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52)

          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8060 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8060/ ) Move YARN-3790 from 2.7.1 to 2.8 in CHANGES.txt (rohithsharmaks: rev 2df00d53d13d16628b6bde5e05133d239f138f52) hadoop-yarn-project/CHANGES.txt
          Hide
          zxu zhihai xu added a comment -

          thanks Jian He and Masatake Iwasaki for the review! thanks Rohith Sharma K S for the review and committing the patch!

          Show
          zxu zhihai xu added a comment - thanks Jian He and Masatake Iwasaki for the review! thanks Rohith Sharma K S for the review and committing the patch!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8059 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8059/)
          YARN-3790. usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8059 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8059/ ) YARN-3790 . usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (Zhihai Xu via rohithsharmaks) (rohithsharmaks: rev dd4b387d96abc66ddebb569b3775b18b19aed027) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Committed to trunk/branch-2. Thanks zhihai xu for your contributions and Jian He and Masatake Iwasaki for review.

          Show
          rohithsharma Rohith Sharma K S added a comment - Committed to trunk/branch-2. Thanks zhihai xu for your contributions and Jian He and Masatake Iwasaki for review.
          Hide
          iwasakims Masatake Iwasaki added a comment -

          I'm +1(non-binding) too. Thanks for working on this. I saw the test failure 2 times on YARN-3705 and would like this to come in.

          Show
          iwasakims Masatake Iwasaki added a comment - I'm +1(non-binding) too. Thanks for working on this. I saw the test failure 2 times on YARN-3705 and would like this to come in.
          Hide
          jianhe Jian He added a comment -

          lgtm, thanks zhihai xu and Rohith Sharma K S

          Show
          jianhe Jian He added a comment - lgtm, thanks zhihai xu and Rohith Sharma K S
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Jian He Do you have any comments on the patch?

          Show
          rohithsharma Rohith Sharma K S added a comment - Jian He Do you have any comments on the patch?
          Hide
          zxu zhihai xu added a comment -

          Rohith Sharma K S thanks for the review, yes, I just updated the component to FairScheduler.

          Show
          zxu zhihai xu added a comment - Rohith Sharma K S thanks for the review, yes, I just updated the component to FairScheduler.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Thank @zhihai for your detailed explanation.. I got the problem..
          Overall patch looks good to me, I think we should change this JIRA component to scheduler since code change is in FairScheduler

          Show
          rohithsharma Rohith Sharma K S added a comment - Thank @zhihai for your detailed explanation.. I got the problem.. Overall patch looks good to me, I think we should change this JIRA component to scheduler since code change is in FairScheduler
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 3s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 javac 7m 33s There were no new javac warning messages.
          +1 javadoc 9m 36s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 45s There were no new checkstyle issues.
          +1 whitespace 0m 0s The patch has no lines that end in whitespace.
          +1 install 1m 33s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 50m 55s Tests passed in hadoop-yarn-server-resourcemanager.
              88m 51s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12738905/YARN-3790.000.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / c7729ef
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8235/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8235/testReport/
          Java 1.7.0_55
          uname Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8235/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 3s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. -1 tests included 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac 7m 33s There were no new javac warning messages. +1 javadoc 9m 36s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 45s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 33s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 50m 55s Tests passed in hadoop-yarn-server-resourcemanager.     88m 51s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12738905/YARN-3790.000.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / c7729ef hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8235/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8235/testReport/ Java 1.7.0_55 uname Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8235/console This message was automatically generated.
          Hide
          zxu zhihai xu added a comment -

          I uploaded a patch YARN-3790.000.patch which will move updateRootQueueMetrics after recoverContainersOnNode in addNode.

          Show
          zxu zhihai xu added a comment - I uploaded a patch YARN-3790 .000.patch which will move updateRootQueueMetrics after recoverContainersOnNode in addNode .
          Hide
          zxu zhihai xu added a comment -

          Rohith Sharma K S, thanks for updating the title.
          The containers are recovered. rootMetrics's used resource is also updated, But rootMetrics's available resource is not updated.
          The following logs in the failed test proved it:

          2015-06-09 22:55:42,964 INFO  [ResourceManager Event Processor] fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node 127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8>
          2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433915736884_0001 of type NODE_UPDATE
          2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(820)) - Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root.default is <memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.default is <memory:0, vCores:0>
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root.default <memory:0, vCores:0>; Total resource consumption for root now <memory:0, vCores:0>
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from application_1433915736884_0001 <memory:0, vCores:0>; Total resource consumption for root.zxu now <memory:0, vCores:0>
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is <memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is <memory:0, vCores:0>
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu <memory:0, vCores:0>; Total resource consumption for root now <memory:0, vCores:0>
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(177)) - The updated demand for root is <memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
          2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSQueue (FSQueue.java:setFairShare(196)) - The updated fairShare for root is <memory:8192, vCores:8>
          2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1 } id: 1 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: ""
          2015-06-09 22:55:42,965 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000001 of type RECOVER
          2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE
          2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000001 Container Transitioned from NEW to RUNNING
          2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433915736884_0001 of type APP_RUNNING_ON_NODE
          2015-06-09 22:55:42,965 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433915736884_0001_01_000001 of capacity <memory:1024, vCores:1> on host 127.0.0.1:1234, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation
          2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433915736884_0001_000001 is recovering container container_1433915736884_0001_01_000001
          2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1 } id: 2 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: ""
          2015-06-09 22:55:42,966 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000002 of type RECOVER
          2015-06-09 22:55:42,966 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE
          2015-06-09 22:55:42,966 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000002 Container Transitioned from NEW to RUNNING
          2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433915736884_0001 of type APP_RUNNING_ON_NODE
          2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433915736884_0001_01_000002 of capacity <memory:1024, vCores:1> on host 127.0.0.1:1234, which has 2 containers, <memory:2048, vCores:2> used and <memory:6144, vCores:6> available after allocation
          2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433915736884_0001_000001 is recovering container container_1433915736884_0001_01_000002
          2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1 } id: 3 } container_state: C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: ""
          2015-06-09 22:55:42,967 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000003 of type RECOVER
          2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType: CONTAINER_FINISHED
          2015-06-09 22:55:42,967 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000003 Container Transitioned from NEW to COMPLETED
          2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event for appattempt_1433915736884_0001_000001 of type CONTAINER_FINISHED
          Container State is : RUNNING
          Container State is : RUNNING
          Received completed containers [ContainerStatus: [ContainerId: container_1433915736884_0001_01_000003, State: COMPLETE, Diagnostics: recover container, ExitStatus: 0, ]]
          2015-06-09 22:55:43,266 DEBUG [Thread-645] fair.FSAppAttempt (FSAppAttempt.java:getHeadroom(200)) - Headroom calculation for application_1433915736884_0001:Min((queueFairShare=<memory:8192, vCores:8> - queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144, vCores:6>Headroom=<memory:6144, vCores:6>
          

          The Headroom calculation log show queueUsage=<memory:2048, vCores:2>, the code which update rootMetrics's used resource is at AppSchedulingInfo#recoverContainer

          metrics.allocateResources(user, 1, rmContainer.getAllocatedResource(), false);
          

          And we can find the update/updateDemand is called by UpdateThread before recoverContainersOnNode.

          The following is the log from the succeeded test.

          2015-06-09 23:40:53,432 INFO  [ResourceManager Event Processor] fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node 127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8>
          2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433918447351_0001 of type NODE_UPDATE
          2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(820)) - Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING
          2015-06-09 23:40:53,432 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1 } id: 1 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: ""
          2015-06-09 23:40:53,433 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000001 of type RECOVER
          2015-06-09 23:40:53,433 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000001 Container Transitioned from NEW to RUNNING
          2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE
          2015-06-09 23:40:53,433 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433918447351_0001_01_000001 of capacity <memory:1024, vCores:1> on host 127.0.0.1:1234, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation
          2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433918447351_0001 of type APP_RUNNING_ON_NODE
          2015-06-09 23:40:53,434 INFO  [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433918447351_0001_000001 is recovering container container_1433918447351_0001_01_000001
          2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1 } id: 2 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: ""
          2015-06-09 23:40:53,435 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000002 of type RECOVER
          2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000002 Container Transitioned from NEW to RUNNING
          2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE
          2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433918447351_0001_01_000002 of capacity <memory:1024, vCores:1> on host 127.0.0.1:1234, which has 2 containers, <memory:2048, vCores:2> used and <memory:6144, vCores:6> available after allocation
          2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433918447351_0001 of type APP_RUNNING_ON_NODE
          2015-06-09 23:40:53,435 INFO  [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433918447351_0001_000001 is recovering container container_1433918447351_0001_01_000002
          2015-06-09 23:40:53,436 INFO  [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1 } id: 3 } container_state: C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: ""
          2015-06-09 23:40:53,436 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000003 of type RECOVER
          2015-06-09 23:40:53,436 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000003 Container Transitioned from NEW to COMPLETED
          2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType: CONTAINER_FINISHED
          2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event for appattempt_1433918447351_0001_000001 of type CONTAINER_FINISHED
          2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root.default is <memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647>
          2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.default is <memory:0, vCores:0>
          2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root.default <memory:0, vCores:0>; Total resource consumption for root now <memory:0, vCores:0>
          2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from application_1433918447351_0001 <memory:2048, vCores:2>; Total resource consumption for root.zxu now <memory:0, vCores:0>
          2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is <memory:2048, vCores:2>; the max is <memory:2147483647, vCores:2147483647>
          2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is <memory:0, vCores:0>
          2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu <memory:2048, vCores:2>; Total resource consumption for root now <memory:0, vCores:0>
          2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(177)) - The updated demand for root is <memory:2048, vCores:2>; the max is <memory:2147483647, vCores:2147483647>
          2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSQueue (FSQueue.java:setFairShare(196)) - The updated fairShare for root is <memory:8192, vCores:8>
          Container State is : RUNNING
          Container State is : RUNNING
          Received completed containers [ContainerStatus: [ContainerId: container_1433918447351_0001_01_000003, State: COMPLETE, Diagnostics: recover container, ExitStatus: 0, ]]
          2015-06-09 23:40:53,632 INFO  [Thread-643] resourcemanager.TestWorkPreservingRMRestart (TestWorkPreservingRMRestart.java:testSchedulerRecovery(240)) - testSchedulerRecovery end
          2015-06-09 23:40:53,733 DEBUG [Thread-643] fair.FSAppAttempt (FSAppAttempt.java:getHeadroom(200)) - Headroom calculation for application_1433918447351_0001:Min((queueFairShare=<memory:8192, vCores:8> - queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144, vCores:6>Headroom=<memory:6144, vCores:6>
          

          And we can find the update/updateDemand is called by UpdateThread after recoverContainersOnNode. The test passed, because update set rootMetrics's available resource correctly after recoverContainersOnNode updated rootMetrics's used resource.

          Show
          zxu zhihai xu added a comment - Rohith Sharma K S , thanks for updating the title. The containers are recovered. rootMetrics 's used resource is also updated, But rootMetrics 's available resource is not updated. The following logs in the failed test proved it: 2015-06-09 22:55:42,964 INFO [ResourceManager Event Processor] fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node 127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8> 2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433915736884_0001 of type NODE_UPDATE 2015-06-09 22:55:42,964 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(820)) - Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root. default is <memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647> 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root. default is <memory:0, vCores:0> 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root. default <memory:0, vCores:0>; Total resource consumption for root now <memory:0, vCores:0> 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from application_1433915736884_0001 <memory:0, vCores:0>; Total resource consumption for root.zxu now <memory:0, vCores:0> 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is <memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647> 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is <memory:0, vCores:0> 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu <memory:0, vCores:0>; Total resource consumption for root now <memory:0, vCores:0> 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(177)) - The updated demand for root is <memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647> 2015-06-09 22:55:42,964 DEBUG [FairSchedulerUpdateThread] fair.FSQueue (FSQueue.java:setFairShare(196)) - The updated fairShare for root is <memory:8192, vCores:8> 2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1 } id: 1 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: "" 2015-06-09 22:55:42,965 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000001 of type RECOVER 2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE 2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000001 Container Transitioned from NEW to RUNNING 2015-06-09 22:55:42,965 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433915736884_0001 of type APP_RUNNING_ON_NODE 2015-06-09 22:55:42,965 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433915736884_0001_01_000001 of capacity <memory:1024, vCores:1> on host 127.0.0.1:1234, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation 2015-06-09 22:55:42,966 INFO [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433915736884_0001_000001 is recovering container container_1433915736884_0001_01_000001 2015-06-09 22:55:42,966 INFO [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1 } id: 2 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: "" 2015-06-09 22:55:42,966 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000002 of type RECOVER 2015-06-09 22:55:42,966 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE 2015-06-09 22:55:42,966 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000002 Container Transitioned from NEW to RUNNING 2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433915736884_0001 of type APP_RUNNING_ON_NODE 2015-06-09 22:55:42,967 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433915736884_0001_01_000002 of capacity <memory:1024, vCores:1> on host 127.0.0.1:1234, which has 2 containers, <memory:2048, vCores:2> used and <memory:6144, vCores:6> available after allocation 2015-06-09 22:55:42,967 INFO [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433915736884_0001_000001 is recovering container container_1433915736884_0001_01_000002 2015-06-09 22:55:42,967 INFO [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433915736884 } attemptId: 1 } id: 3 } container_state: C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: "" 2015-06-09 22:55:42,967 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433915736884_0001_01_000003 of type RECOVER 2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType: CONTAINER_FINISHED 2015-06-09 22:55:42,967 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433915736884_0001_01_000003 Container Transitioned from NEW to COMPLETED 2015-06-09 22:55:42,967 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event for appattempt_1433915736884_0001_000001 of type CONTAINER_FINISHED Container State is : RUNNING Container State is : RUNNING Received completed containers [ContainerStatus: [ContainerId: container_1433915736884_0001_01_000003, State: COMPLETE, Diagnostics: recover container, ExitStatus: 0, ]] 2015-06-09 22:55:43,266 DEBUG [ Thread -645] fair.FSAppAttempt (FSAppAttempt.java:getHeadroom(200)) - Headroom calculation for application_1433915736884_0001:Min((queueFairShare=<memory:8192, vCores:8> - queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144, vCores:6>Headroom=<memory:6144, vCores:6> The Headroom calculation log show queueUsage=<memory:2048, vCores:2>, the code which update rootMetrics 's used resource is at AppSchedulingInfo#recoverContainer metrics.allocateResources(user, 1, rmContainer.getAllocatedResource(), false ); And we can find the update / updateDemand is called by UpdateThread before recoverContainersOnNode . The following is the log from the succeeded test. 2015-06-09 23:40:53,432 INFO [ResourceManager Event Processor] fair.FairScheduler (FairScheduler.java:addNode(855)) - Added node 127.0.0.1:1234 cluster capacity: <memory:8192, vCores:8> 2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433918447351_0001 of type NODE_UPDATE 2015-06-09 23:40:53,432 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:processNodeUpdate(820)) - Received node update event:NODE_USABLE for node:127.0.0.1:1234 with state:RUNNING 2015-06-09 23:40:53,432 INFO [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1 } id: 1 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: "" 2015-06-09 23:40:53,433 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000001 of type RECOVER 2015-06-09 23:40:53,433 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000001 Container Transitioned from NEW to RUNNING 2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE 2015-06-09 23:40:53,433 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433918447351_0001_01_000001 of capacity <memory:1024, vCores:1> on host 127.0.0.1:1234, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation 2015-06-09 23:40:53,433 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433918447351_0001 of type APP_RUNNING_ON_NODE 2015-06-09 23:40:53,434 INFO [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433918447351_0001_000001 is recovering container container_1433918447351_0001_01_000001 2015-06-09 23:40:53,435 INFO [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1 } id: 2 } container_state: C_RUNNING resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: "" 2015-06-09 23:40:53,435 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000002 of type RECOVER 2015-06-09 23:40:53,435 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000002 Container Transitioned from NEW to RUNNING 2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppRunningOnNodeEvent.EventType: APP_RUNNING_ON_NODE 2015-06-09 23:40:53,435 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154)) - Assigned container container_1433918447351_0001_01_000002 of capacity <memory:1024, vCores:1> on host 127.0.0.1:1234, which has 2 containers, <memory:2048, vCores:2> used and <memory:6144, vCores:6> available after allocation 2015-06-09 23:40:53,435 DEBUG [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(756)) - Processing event for application_1433918447351_0001 of type APP_RUNNING_ON_NODE 2015-06-09 23:40:53,435 INFO [ResourceManager Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:recoverContainer(651)) - SchedulerAttempt appattempt_1433918447351_0001_000001 is recovering container container_1433918447351_0001_01_000002 2015-06-09 23:40:53,436 INFO [ResourceManager Event Processor] scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:recoverContainersOnNode(349)) - Recovering container container_id { app_attempt_id { application_id { id: 1 cluster_timestamp: 1433918447351 } attemptId: 1 } id: 3 } container_state: C_COMPLETE resource { memory: 1024 virtual_cores: 1 } priority { priority: 0 } diagnostics: "recover container" container_exit_status: 0 creation_time: 0 nodeLabelExpression: "" 2015-06-09 23:40:53,436 DEBUG [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(382)) - Processing container_1433918447351_0001_01_000003 of type RECOVER 2015-06-09 23:40:53,436 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(394)) - container_1433918447351_0001_01_000003 Container Transitioned from NEW to COMPLETED 2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(167)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.event.RMAppAttemptContainerFinishedEvent.EventType: CONTAINER_FINISHED 2015-06-09 23:40:53,436 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(781)) - Processing event for appattempt_1433918447351_0001_000001 of type CONTAINER_FINISHED 2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root. default is <memory:0, vCores:0>; the max is <memory:2147483647, vCores:2147483647> 2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root. default is <memory:0, vCores:0> 2015-06-09 23:40:53,436 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root. default <memory:0, vCores:0>; Total resource consumption for root now <memory:0, vCores:0> 2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemandForApp(298)) - Counting resource from application_1433918447351_0001 <memory:2048, vCores:2>; Total resource consumption for root.zxu now <memory:0, vCores:0> 2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(287)) - The updated demand for root.zxu is <memory:2048, vCores:2>; the max is <memory:2147483647, vCores:2147483647> 2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSLeafQueue (FSLeafQueue.java:updateDemand(289)) - The updated fairshare for root.zxu is <memory:0, vCores:0> 2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(163)) - Counting resource from root.zxu <memory:2048, vCores:2>; Total resource consumption for root now <memory:0, vCores:0> 2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSParentQueue (FSParentQueue.java:updateDemand(177)) - The updated demand for root is <memory:2048, vCores:2>; the max is <memory:2147483647, vCores:2147483647> 2015-06-09 23:40:53,437 DEBUG [FairSchedulerUpdateThread] fair.FSQueue (FSQueue.java:setFairShare(196)) - The updated fairShare for root is <memory:8192, vCores:8> Container State is : RUNNING Container State is : RUNNING Received completed containers [ContainerStatus: [ContainerId: container_1433918447351_0001_01_000003, State: COMPLETE, Diagnostics: recover container, ExitStatus: 0, ]] 2015-06-09 23:40:53,632 INFO [ Thread -643] resourcemanager.TestWorkPreservingRMRestart (TestWorkPreservingRMRestart.java:testSchedulerRecovery(240)) - testSchedulerRecovery end 2015-06-09 23:40:53,733 DEBUG [ Thread -643] fair.FSAppAttempt (FSAppAttempt.java:getHeadroom(200)) - Headroom calculation for application_1433918447351_0001:Min((queueFairShare=<memory:8192, vCores:8> - queueUsage=<memory:2048, vCores:2>), maxAvailableResource=<memory:6144, vCores:6>Headroom=<memory:6144, vCores:6> And we can find the update / updateDemand is called by UpdateThread after recoverContainersOnNode . The test passed, because update set rootMetrics 's available resource correctly after recoverContainersOnNode updated rootMetrics 's used resource.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Thanks for looking into this issue,

          If UpdateThread call update after recoverContainersOnNode, the test will succeed

          In the test, I see below code which verify for contaner to recover right?

              // Wait for RM to settle down on recovering containers;
              waitForNumContainersToRecover(2, rm2, am1.getApplicationAttemptId());
              Set<ContainerId> launchedContainers =
                  ((RMNodeImpl) rm2.getRMContext().getRMNodes().get(nm1.getNodeId()))
                    .getLaunchedContainers();
              assertTrue(launchedContainers.contains(amContainer.getContainerId()));
              assertTrue(launchedContainers.contains(runningContainer.getContainerId()));
          

          Am I missing anything?

          Show
          rohithsharma Rohith Sharma K S added a comment - Thanks for looking into this issue, If UpdateThread call update after recoverContainersOnNode, the test will succeed In the test, I see below code which verify for contaner to recover right? // Wait for RM to settle down on recovering containers; waitForNumContainersToRecover(2, rm2, am1.getApplicationAttemptId()); Set<ContainerId> launchedContainers = ((RMNodeImpl) rm2.getRMContext().getRMNodes().get(nm1.getNodeId())) .getLaunchedContainers(); assertTrue(launchedContainers.contains(amContainer.getContainerId())); assertTrue(launchedContainers.contains(runningContainer.getContainerId())); Am I missing anything?
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          I think this test fails intermittently.

          Yes, it is failing intermittenlty. May be issue summary can be updated.

          Show
          rohithsharma Rohith Sharma K S added a comment - I think this test fails intermittently. Yes, it is failing intermittenlty. May be issue summary can be updated.
          Hide
          zxu zhihai xu added a comment -

          Hi Rohith Sharma K S, thanks for reporting this issue. I think this test fails intermittently.
          The following is stack trace for the test failure:

          java.lang.AssertionError: expected:<6144> but was:<8192>
          	at org.junit.Assert.fail(Assert.java:88)
          	at org.junit.Assert.failNotEquals(Assert.java:743)
          	at org.junit.Assert.assertEquals(Assert.java:118)
          	at org.junit.Assert.assertEquals(Assert.java:555)
          	at org.junit.Assert.assertEquals(Assert.java:542)
          	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:852)
          	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:341)
          	at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:240)
          

          The failure is rootMetrics's available resource is not correct for FairScheduler.
          I know what cause this test failure.
          For FairScheduler, updateRootQueueMetrics is used to update rootMetrics's available resource.
          But updateRootQueueMetrics is not called in/after recoverContainersOnNode, in this case, we can only depend UpdateThread to update rootMetrics's available resource. Currently UpdateThread will be triggered in addNode. The timing in UpdateThread will decide whether this test will succeed or not. If UpdateThread call update after recoverContainersOnNode, the test will succeed. If UpdateThread call update before recoverContainersOnNode, the test will fail.

          Show
          zxu zhihai xu added a comment - Hi Rohith Sharma K S , thanks for reporting this issue. I think this test fails intermittently. The following is stack trace for the test failure: java.lang.AssertionError: expected:<6144> but was:<8192> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:852) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:341) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:240) The failure is rootMetrics 's available resource is not correct for FairScheduler. I know what cause this test failure. For FairScheduler, updateRootQueueMetrics is used to update rootMetrics 's available resource. But updateRootQueueMetrics is not called in/after recoverContainersOnNode , in this case, we can only depend UpdateThread to update rootMetrics 's available resource. Currently UpdateThread will be triggered in addNode . The timing in UpdateThread will decide whether this test will succeed or not. If UpdateThread call update after recoverContainersOnNode , the test will succeed. If UpdateThread call update before recoverContainersOnNode , the test will fail.

            People

            • Assignee:
              zxu zhihai xu
              Reporter:
              rohithsharma Rohith Sharma K S
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development