Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
3.0.0-alpha1
-
None
-
Reviewed
Description
1. Decommission one NodeManager by configuring ip in excludehost file
2. Remove ip from excludehost file
3. Execute -refreshNodes command and restart Decommissioned NodeManager
Observe that in RM UI negative value for "Decommissioned Nodes" field is shown
Attachments
Attachments
- YARN-2523.patch
- 4 kB
- Rohith Sharma K S
- YARN-2523.patch
- 4 kB
- Rohith Sharma K S
- YARN-2523.1.patch
- 8 kB
- Rohith Sharma K S
- YARN-2523.2.patch
- 9 kB
- Rohith Sharma K S
Activity
uploaded patch to fix this issue.
Test details :
1. Recured using test, and applied patch.
2. Removed duplicate assertion line.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12668568/YARN-2523.patch
against trunk revision 98588cf.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4953//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4953//console
This message is automatically generated.
Verified the fix again, decommissioned nodes should not be decremented again RMNodeImpl#updateMetricsForRejoinedNode() considering previoud state. Latest decommisioned nodes already been updated by AdminService#refreshNodes().
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12668575/YARN-2523.patch
against trunk revision 98588cf.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4954//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4954//console
This message is automatically generated.
Test failure checked, it is not related to fix. I raised new ticket YARN-2550 track it
Thanks for the patch, Rohith!
It seems inconsistent that we call incrDecommissionedNMs but never call decrDecommissionedNMs, and I think it can cause problems. For example, if we only have the include list specified but not an exclude list, nodes that attempt to join that are not in the include list will go into the DECOMMISSIONED state and increment the corresponding metric, but if we later refresh the nodes then I think we'll set the metric for decommissioned nodes to zero (because the exclude list is still empty) but there could be a non-zero number of decommissioned nodes.
Thank you Jason Lowe for your suggestion.
Considering your point, I did some more tests without my patch.
- 1 Add hosts in include list only and refresh nodes. Decommisioned node is 1. If again call refreshNodes, then Decommisioned nodes is 0.
- 2 Add hosts in include list only and refresh nodes. Decommisioned node is 1. If Restart RM , then Decommisioned nodes is 0. But here RM cant get old value unless it is store at zookeeper.
- 3 Add hosts in exclude list only and refresh nodes. Decommisioned node is 1. Remove hosts from eclude list and refresh nodes. Start NodeManger. Decommisioned nodes is -1.
Setting decomissioned nodes on refreshNode causing problem. However RMNodeImpl sets while deactivating node will be fine. For RM restart , setting at serviceInit holds good.
Updated the patch for handling tests mentioned in my previous comment. Please review
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12670954/YARN-2523.1.patch
against trunk revision ef784a2.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5095//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5095//console
This message is automatically generated.
Thanks for updating the patch. I think it looks good overall.
jianhe could you take a look? This patch undoes a chunk of YARN-1071, and I want to make sure we don't accidentally regress something there.
Patch looks good to me too.
YARN-1071 was done to make sure decommissioned count is consistent across RM restart, but apparently causing more problems..
Not related to this jira, just to improve the previous testDecomissionedNMsMetricsOnRMRestart. Maybe we can add the following before restarting 2nd RM so that the decommissioned count is reset to 0 before 2nd RM restart, as I think in test both RM share the same ClusterMetrics instance.
// make sure decommissioned count is 0 2nd RM before restart. ClusterMetrics.getMetrics().decrDecommisionedNMs(); ClusterMetrics.getMetrics().decrDecommisionedNMs(); Assert.assertEquals(0, ClusterMetrics.getMetrics().getNumDecommisionedNMs()); // restart RM. MockRM rm2 = new MockRM(conf);
Thanks Jian He for looking into patch. I will relook into test and update it.
Not related to this jira, just to improve the previous testDecomissionedNMsMetricsOnRMRestart
Done. Instead of decrementing 2 times, I stopped rm1 before rm2 start.It destroy ClusterMetrics.
I updated the patch for the same, please review..
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12671175/YARN-2523.2.patch
against trunk revision dff95f7.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5119//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5119//console
This message is automatically generated.
FAILURE: Integrated in Hadoop-trunk-Commit #6113 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6113/)
YARN-2523. ResourceManager UI showing negative value for "Decommissioned Nodes" field. Contributed by Rohith (jlowe: rev 8269bfa613999f71767de3c0369817b58cfe1416)
- hadoop-yarn-project/CHANGES.txt
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
Thanks to Rohith for the contribution and to Jian for additional review! I committed this to trunk and branch-2.
FAILURE: Integrated in Hadoop-Yarn-trunk #692 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/692/)
YARN-2523. ResourceManager UI showing negative value for "Decommissioned Nodes" field. Contributed by Rohith (jlowe: rev 8269bfa613999f71767de3c0369817b58cfe1416)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
- hadoop-yarn-project/CHANGES.txt
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
SUCCESS: Integrated in Hadoop-Hdfs-trunk #1883 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1883/)
YARN-2523. ResourceManager UI showing negative value for "Decommissioned Nodes" field. Contributed by Rohith (jlowe: rev 8269bfa613999f71767de3c0369817b58cfe1416)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
- hadoop-yarn-project/CHANGES.txt
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
FAILURE: Integrated in Hadoop-Mapreduce-trunk #1908 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1908/)
YARN-2523. ResourceManager UI showing negative value for "Decommissioned Nodes" field. Contributed by Rohith (jlowe: rev 8269bfa613999f71767de3c0369817b58cfe1416)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
- hadoop-yarn-project/CHANGES.txt
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
Decommissioned Node metrics are set by NodeListManager. If decommission nodes rejoin, then RMNodeImpl#updateMetricsForRejoinedNode() again decrements metrics by 1 which cause negative value.
There should have check in RMNodeImpl#updateMetricsForRejoinedNode() for decommission state.
if (!ecludedHosts.contains(hostName) && !ecludedHosts.contains(NetUtils.normalizeHostName(hostName))) { metrics.decrDecommisionedNMs(); }