Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7725

Incorrect "nodes in service" metrics caused all writes to fail

    Details

      Description

      One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs.

      the node is too busy (load:x > y)
      

      It turns out the HeartbeatManager's nodesInService was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

      • Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / Zhe Zhang, Andrew Wang, Aaron T. Myers Here is the sequence of event without HDFS-7374.
        • Cluster has one live node. nodesInService == 1
        • The node becomes dead. nodesInService == 0
        • Decomm the node. nodesInService == -1
      • However, HDFS-7374 introduces another inconsistency when recomm is involved.
        • Cluster has one live node. nodesInService == 1
        • The node becomes dead. nodesInService == 0
        • Decomm the node. nodesInService == 0
        • Recomm the node. nodesInService == 1
      1. HDFS-7725.patch
        3 kB
        Ming Ma
      2. HDFS-7725-2.patch
        3 kB
        Ming Ma
      3. HDFS-7725-3.patch
        7 kB
        Ming Ma

        Issue Links

          Activity

          Hide
          mingma Ming Ma added a comment -

          The patch makes sure the nodeInService count won't be updated when a dead node is recommissioned.

          Show
          mingma Ming Ma added a comment - The patch makes sure the nodeInService count won't be updated when a dead node is recommissioned.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12695976/HDFS-7725.patch
          against trunk revision f33c99b.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.balancer.TestBalancer

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9396//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9396//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12695976/HDFS-7725.patch against trunk revision f33c99b. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancer Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9396//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9396//console This message is automatically generated.
          Hide
          zhz Zhe Zhang added a comment -

          However, HDFS-7374 introduces another inconsistency when recomm is involved.

          The second sequence in the JIRA description looks correct?

          By reading the HDFS-7374 patch I see the potential issue is that HeartbeatManager is bypassed when decommissioning a dead node:

          +    if (!node.isDecommissionInProgress()) {
          +      if (!node.isAlive) {
          +        LOG.info("Dead node " + node + " is decommissioned immediately.");
          +        node.setDecommissioned();
          +      } else if (!node.isDecommissioned()) {
          +        for (DatanodeStorageInfo storage : node.getStorageInfos()) {
          +          LOG.info("Start Decommissioning " + node + " " + storage
          +              + " with " + storage.numBlocks() + " blocks");
          +        }
          +        heartbeatManager.startDecommission(node);
          

          It seems DatanodeManager should still route the call to HeartbeatManager, and HeartbeatManager#startDecommission should handle the dead node logic.

          Maybe we should wait for HDFS-7411 to be committed and revisit the change?

          Show
          zhz Zhe Zhang added a comment - However, HDFS-7374 introduces another inconsistency when recomm is involved. The second sequence in the JIRA description looks correct? By reading the HDFS-7374 patch I see the potential issue is that HeartbeatManager is bypassed when decommissioning a dead node: + if (!node.isDecommissionInProgress()) { + if (!node.isAlive) { + LOG.info( "Dead node " + node + " is decommissioned immediately." ); + node.setDecommissioned(); + } else if (!node.isDecommissioned()) { + for (DatanodeStorageInfo storage : node.getStorageInfos()) { + LOG.info( "Start Decommissioning " + node + " " + storage + + " with " + storage.numBlocks() + " blocks" ); + } + heartbeatManager.startDecommission(node); It seems DatanodeManager should still route the call to HeartbeatManager , and HeartbeatManager#startDecommission should handle the dead node logic. Maybe we should wait for HDFS-7411 to be committed and revisit the change?
          Hide
          mingma Ming Ma added a comment -

          Thanks, Zhe Zhang. Yes, the logic of "don't modify nn stats if the node is dead" can be moved to HeartbeatManager.

          For the trunk version, we can wait for HDFS-7411. If HDFS-7411 isn't going to be in branch-2 anytime soon, then we will need some quick fix. Overall, can anyone find any correctness issue with the current patch?

          Show
          mingma Ming Ma added a comment - Thanks, Zhe Zhang . Yes, the logic of "don't modify nn stats if the node is dead" can be moved to HeartbeatManager. For the trunk version, we can wait for HDFS-7411 . If HDFS-7411 isn't going to be in branch-2 anytime soon, then we will need some quick fix. Overall, can anyone find any correctness issue with the current patch?
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Ming. The patch looks good to me. It basically bypasses HeartbeatManager in stop decomm since start decomm is already bypassing it.

          Show
          zhz Zhe Zhang added a comment - Thanks Ming. The patch looks good to me. It basically bypasses HeartbeatManager in stop decomm since start decomm is already bypassing it.
          Hide
          mingma Ming Ma added a comment -

          Thanks, Zhe. Here is the rebase of the patch.

          Show
          mingma Ming Ma added a comment - Thanks, Zhe. Here is the rebase of the patch.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708277/HDFS-7725-2.patch
          against trunk revision 1a495fb.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.TestNameNodeResourceChecker
          org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
          org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

          The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.web.TestWebHDFSAcl

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10121//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10121//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708277/HDFS-7725-2.patch against trunk revision 1a495fb. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestNameNodeResourceChecker org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.web.TestWebHDFSAcl Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10121//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10121//console This message is automatically generated.
          Hide
          zhz Zhe Zhang added a comment -

          Thanks Ming. I didn't manually verify the failed tests, but the patch looks good to me. Non-binding +1.

          Show
          zhz Zhe Zhang added a comment - Thanks Ming. I didn't manually verify the failed tests, but the patch looks good to me. Non-binding +1.
          Hide
          andrew.wang Andrew Wang added a comment -

          Thanks for working on this Ming. Nice find, patch looks basically good. Just a few comments:

          I agree with Zhe's original review comment above, I think we should move the liveness check for both start and stop into heartbeat manager. This way the caller doesn't have to worry about it.

          It would also be good to add "alive" or "dead" to the first log in stopDecommission too, just to give admins some more information about node state.

          Do we also need assert checks in the test after recommissioning the dead node?

          Show
          andrew.wang Andrew Wang added a comment - Thanks for working on this Ming. Nice find, patch looks basically good. Just a few comments: I agree with Zhe's original review comment above, I think we should move the liveness check for both start and stop into heartbeat manager. This way the caller doesn't have to worry about it. It would also be good to add "alive" or "dead" to the first log in stopDecommission too, just to give admins some more information about node state. Do we also need assert checks in the test after recommissioning the dead node?
          Hide
          mingma Ming Ma added a comment -

          Thanks, Andrew and Zhe. The latest patch moves the liveness check to HeartbeatManager with some minor changes to DecommissionManager's startDecommission. The patch also has the other suggestions you mentioned.

          Show
          mingma Ming Ma added a comment - Thanks, Andrew and Zhe. The latest patch moves the liveness check to HeartbeatManager with some minor changes to DecommissionManager 's startDecommission . The patch also has the other suggestions you mentioned.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12709323/HDFS-7725-3.patch
          against trunk revision 023133c.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestReplication

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10177//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10177//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709323/HDFS-7725-3.patch against trunk revision 023133c. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestReplication Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10177//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10177//console This message is automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          Will it also fix HDFS-5114?

          Show
          kihwal Kihwal Lee added a comment - Will it also fix HDFS-5114 ?
          Hide
          andrew.wang Andrew Wang added a comment -

          Hi Ming, the latest looks good to me. +1 will commit shortly.

          Kihwal, I looked quickly and I don't think this will change getMaxNodesPerRack behavior. This fixes up the live/dead node counts kept in the heartbeat manager, and I don't think those counts are used by getMaxNodesPerRack.

          Show
          andrew.wang Andrew Wang added a comment - Hi Ming, the latest looks good to me. +1 will commit shortly. Kihwal, I looked quickly and I don't think this will change getMaxNodesPerRack behavior. This fixes up the live/dead node counts kept in the heartbeat manager, and I don't think those counts are used by getMaxNodesPerRack.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7539 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7539/)
          HDFS-7725. Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7539 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7539/ ) HDFS-7725 . Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          Hide
          andrew.wang Andrew Wang added a comment -

          Committed to trunk and branch-2, thanks for the patch Ming, Zhe for also reviewing.

          The failed test also failed twice for me without the patch applied, it seems independently broken.

          Show
          andrew.wang Andrew Wang added a comment - Committed to trunk and branch-2, thanks for the patch Ming, Zhe for also reviewing. The failed test also failed twice for me without the patch applied, it seems independently broken.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #158 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/158/)
          HDFS-7725. Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #158 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/158/ ) HDFS-7725 . Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2090 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2090/)
          HDFS-7725. Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2090 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2090/ ) HDFS-7725 . Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #149 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/149/)
          HDFS-7725. Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12)

          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #149 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/149/ ) HDFS-7725 . Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #892 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/892/)
          HDFS-7725. Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #892 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/892/ ) HDFS-7725 . Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #159 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/159/)
          HDFS-7725. Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12)

          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java
          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #159 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/159/ ) HDFS-7725 . Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2108 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2108/)
          HDFS-7725. Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2108 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2108/ ) HDFS-7725 . Incorrect 'nodes in service' metrics caused all writes to fail. Contributed by Ming Ma. (wang: rev 6af0d74a75f0f58d5e92e2e91e87735b9a62bb12) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestNamenodeCapacityReport.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DecommissionManager.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
          Hide
          mingma Ming Ma added a comment -

          Thanks Zhe and Andrew.

          Show
          mingma Ming Ma added a comment - Thanks Zhe and Andrew.
          Hide
          kshukla Kuhu Shukla added a comment -

          This issue manifested on 2.6 as prior to HDFS-7374 ( with count going to -1) and in 2.7 partially during recommissioning.
          The unit test from this patch (testXceiverCount) fails on 2.7 during the recommission assert :

           //Verify recommission of dead node won't impact nodesInService metrics.
                  dnm.stopDecommission(dnd);
                  assertEquals(expectedInServiceNodes,getNumDNInService(namesystem));
          

          It would be nice to have this patch ported to 2.7. Ming Ma, any suggestions/comments would be helpful.

          Show
          kshukla Kuhu Shukla added a comment - This issue manifested on 2.6 as prior to HDFS-7374 ( with count going to -1) and in 2.7 partially during recommissioning. The unit test from this patch (testXceiverCount) fails on 2.7 during the recommission assert : //Verify recommission of dead node won't impact nodesInService metrics. dnm.stopDecommission(dnd); assertEquals(expectedInServiceNodes,getNumDNInService(namesystem)); It would be nice to have this patch ported to 2.7. Ming Ma , any suggestions/comments would be helpful.
          Hide
          kihwal Kihwal Lee added a comment -

          Cherry-picked it to branch-2.7.

          Show
          kihwal Kihwal Lee added a comment - Cherry-picked it to branch-2.7.

            People

            • Assignee:
              mingma Ming Ma
              Reporter:
              mingma Ming Ma
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development