Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6180

dead node count / listing is very broken in JMX and old GUI

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.5.0
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      After bringing up a 578 node cluster with 13 dead nodes, 0 were reported on the new GUI, but showed up properly in the datanodes tab. Some nodes are also being double reported in the deadnode and inservice section (22 show up dead, 565 show up alive, 9 duplicated nodes).

      From /jmx (confirmed that it's the same in jconsole):

      {
          "name" : "Hadoop:service=NameNode,name=FSNamesystemState",
          "modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
          "CapacityTotal" : 5477748687372288,
          "CapacityUsed" : 24825720407,
          "CapacityRemaining" : 5477723861651881,
          "TotalLoad" : 565,
          "SnapshotStats" : "{\"SnapshottableDirectories\":0,\"Snapshots\":0}",
          "BlocksTotal" : 21065,
          "MaxObjects" : 0,
          "FilesTotal" : 25454,
          "PendingReplicationBlocks" : 0,
          "UnderReplicatedBlocks" : 0,
          "ScheduledReplicationBlocks" : 0,
          "FSState" : "Operational",
          "NumLiveDataNodes" : 565,
          "NumDeadDataNodes" : 0,
          "NumDecomLiveDataNodes" : 0,
          "NumDecomDeadDataNodes" : 0,
          "NumDecommissioningDataNodes" : 0,
          "NumStaleDataNodes" : 1
        },
      

      I'm not going to include deadnode/livenodes because the list is huge, but I've confirmed there are 9 nodes showing up in both deadnodes and livenodes.

      1. dn.log
        38 kB
        Travis Thompson
      2. HDFS-6180.000.patch
        35 kB
        Haohui Mai
      3. HDFS-6180.001.patch
        40 kB
        Haohui Mai
      4. HDFS-6180.002.patch
        40 kB
        Haohui Mai
      5. HDFS-6180.003.patch
        43 kB
        Haohui Mai
      6. HDFS-6180.004.patch
        43 kB
        Haohui Mai

        Issue Links

          Activity

          Hide
          Travis Thompson added a comment -

          Also, if a live node is shutdown, the counter will go to 1 dead node.

          Show
          Travis Thompson added a comment - Also, if a live node is shutdown, the counter will go to 1 dead node.
          Hide
          Haohui Mai added a comment - - edited

          Just take a quick look and it seems the issue is inside the datanode manager. Can you post some logs about the 9 duplicated DNs?

          Show
          Haohui Mai added a comment - - edited Just take a quick look and it seems the issue is inside the datanode manager. Can you post some logs about the 9 duplicated DNs?
          Hide
          Travis Thompson added a comment -

          DN log of a node that is duplicated. Also note, if I start just the Namenode, it reports 0 dead nodes even though none of the datanodes aren't running.

          Show
          Travis Thompson added a comment - DN log of a node that is duplicated. Also note, if I start just the Namenode, it reports 0 dead nodes even though none of the datanodes aren't running.
          Hide
          Travis Thompson added a comment -

          Also notice that duplication starts around 200 nodes, everytime it's started, it always reports 22 dead nodes (9 duplicated), but they're different nodes every time.

          Show
          Travis Thompson added a comment - Also notice that duplication starts around 200 nodes, everytime it's started, it always reports 22 dead nodes (9 duplicated), but they're different nodes every time.
          Hide
          Travis Thompson added a comment -

          Looks like I lied, it's the same 9 nodes that are getting duplicated everytime.

          Show
          Travis Thompson added a comment - Looks like I lied, it's the same 9 nodes that are getting duplicated everytime.
          Hide
          Haohui Mai added a comment -

          Sorry, I mean the NN logs. It's interesting to see what happened in the DataNodeManager in NN.

          Show
          Haohui Mai added a comment - Sorry, I mean the NN logs. It's interesting to see what happened in the DataNodeManager in NN.
          Hide
          Travis Thompson added a comment -

          After looking at it for a while, it seems like a bigger issue than initially thought. The last digit of the last octet being truncated.

          Example:

          LIVE NODE LIVE & DEAD NODES
          datanode5464 (xxx.xxx.138.244) datanode5391 (xxx.xxx.138.24)
          datanode5486 (xxx.xxx.138.222) datanode5392 (xxx.xxx.138.22)
          datanode5477 (xxx.xxx.138.233) datanode5393 (xxx.xxx.138.23)
          datanode5601 (xxx.xxx.139.244) datanode5526 (xxx.xxx.139.24)

          Starting datanode5464 will force datanode5391 to become both live and dead if it's running. The order in which the pair up doesn't matter, the same effect will happen in the end.

          Show
          Travis Thompson added a comment - After looking at it for a while, it seems like a bigger issue than initially thought. The last digit of the last octet being truncated. Example: LIVE NODE LIVE & DEAD NODES datanode5464 (xxx.xxx.138.244) datanode5391 (xxx.xxx.138.24) datanode5486 (xxx.xxx.138.222) datanode5392 (xxx.xxx.138.22) datanode5477 (xxx.xxx.138.233) datanode5393 (xxx.xxx.138.23) datanode5601 (xxx.xxx.139.244) datanode5526 (xxx.xxx.139.24) Starting datanode5464 will force datanode5391 to become both live and dead if it's running. The order in which the pair up doesn't matter, the same effect will happen in the end.
          Show
          Haohui Mai added a comment - Travis Thompson , do you think it is related to the following link? https://issues.apache.org/jira/browse/HDFS-3224?focusedCommentId=13472817&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13472817
          Hide
          Travis Thompson added a comment -

          I don't think so, these nodes were added with fresh installs, ip's have never changed and they've never been part of another hdfs instance.

          Show
          Travis Thompson added a comment - I don't think so, these nodes were added with fresh installs, ip's have never changed and they've never been part of another hdfs instance.
          Hide
          Haohui Mai added a comment -

          Travis Thompson, can you turn on the debug log and post them?

          Show
          Haohui Mai added a comment - Travis Thompson , can you turn on the debug log and post them?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Mohammad, I understand that you might be interested in fixing this. Thank you! Haohui actually started working on this earlier but he forgot to assign this to himself. He is going to post a patch soon. Assigning to him.

          Show
          Tsz Wo Nicholas Sze added a comment - Mohammad, I understand that you might be interested in fixing this. Thank you! Haohui actually started working on this earlier but he forgot to assign this to himself. He is going to post a patch soon. Assigning to him.
          Hide
          Haohui Mai added a comment -

          The duplicated entry is added by the following code in the DatanodeManager

              if (listDeadNodes) {
                final EntrySet includedNodes = hostFileManager.getIncludes();
                final EntrySet excludedNodes = hostFileManager.getExcludes();
                for (Entry entry : includedNodes) {
                  if ((foundNodes.find(entry) == null) &&
                      (excludedNodes.find(entry) == null)) {
          

          Note that entry does not contain the port as it comes from the include file, but all entries in foundNode do. If passed in an entry without port, the find function should be able to match it with the one with port information.

          Internally find is implemented in TreeMap, which uses ip or ip:port as the key. Since in lexically order the entry with port comes after the one without port, it implements the port matching rule by checking whether the next entry has the same id. The problem is that this heuristic is unreliable. It returns buggy results for the below examples:

          172.18.146.3:1019
          172.18.146.30:1019
          

          Calling find(172.18.146.3) checks 172.18.146.30:1019 instead of 172.18.146.3:1019, resulting the bug.

          The bug can be quite confusing from the end user's prospective and I'd like to move forward as quickly as possible.

          Mohammad Kamrul Islam, are you working on it? If not I can work on a patch later today.

          Show
          Haohui Mai added a comment - The duplicated entry is added by the following code in the DatanodeManager if (listDeadNodes) { final EntrySet includedNodes = hostFileManager.getIncludes(); final EntrySet excludedNodes = hostFileManager.getExcludes(); for (Entry entry : includedNodes) { if ((foundNodes.find(entry) == null ) && (excludedNodes.find(entry) == null )) { Note that entry does not contain the port as it comes from the include file, but all entries in foundNode do. If passed in an entry without port, the find function should be able to match it with the one with port information. Internally find is implemented in TreeMap , which uses ip or ip:port as the key. Since in lexically order the entry with port comes after the one without port, it implements the port matching rule by checking whether the next entry has the same id. The problem is that this heuristic is unreliable. It returns buggy results for the below examples: 172.18.146.3:1019 172.18.146.30:1019 Calling find(172.18.146.3) checks 172.18.146.30:1019 instead of 172.18.146.3:1019 , resulting the bug. The bug can be quite confusing from the end user's prospective and I'd like to move forward as quickly as possible. Mohammad Kamrul Islam , are you working on it? If not I can work on a patch later today.
          Hide
          Mohammad Kamrul Islam added a comment -

          Yes. We ( Anil Alluri and I) also found the same code as buggy.
          We are working on a solution.
          I have couple of possible options that I will propose soon for comments.

          Show
          Mohammad Kamrul Islam added a comment - Yes. We ( Anil Alluri and I) also found the same code as buggy. We are working on a solution. I have couple of possible options that I will propose soon for comments.
          Hide
          Mohammad Kamrul Islam added a comment -

          This is the right Apache way!
          I'm fine with Haohui Mai to work with this.
          I will stop working on this.

          Looking for the patch!

          Show
          Mohammad Kamrul Islam added a comment - This is the right Apache way! I'm fine with Haohui Mai to work with this. I will stop working on this. Looking for the patch!
          Hide
          Travis Thompson added a comment -

          On a side note, the JMX mbean for deadNodeCount is also broken, I can open another jira for this issue if desired.

          Show
          Travis Thompson added a comment - On a side note, the JMX mbean for deadNodeCount is also broken, I can open another jira for this issue if desired.
          Hide
          Travis Thompson added a comment -

          Sorry, meant NumDeadDataNodes

          Show
          Travis Thompson added a comment - Sorry, meant NumDeadDataNodes
          Hide
          Haohui Mai added a comment -

          Raising the priority because it directly causes confusions to the operators.

          Show
          Haohui Mai added a comment - Raising the priority because it directly causes confusions to the operators.
          Hide
          Haohui Mai added a comment -

          The v0 patch makes the following changes:

          1. It canonicalizes all entries in the include / exclude lists into ip addresses. It ignores entries that the DNS fails to resolve. This is okay because by default the NN refuses to register the DNs of which it fails to resolve their ips and the hostnames. (see dfs.namenode.datanode.registration.ip-hostname-check for more details) The patch maintains the old behavior which only performs DNS lookups when loading the lists, so that the DNS overhead is minimized.
          2. The patch continues to assume that an entry without ports in the list matches all endpoints (i.e., addr:port) that match the IP of the entry. This defines a partial order between endpoints that have the same IPs. More concretely, if two endpoints A and B have the same IPs, then A <= B iff a.port == b.port or b.port == 0. That way checking the include and the exclude list becomes finding the meet or the join elements in the lattice.
          Show
          Haohui Mai added a comment - The v0 patch makes the following changes: It canonicalizes all entries in the include / exclude lists into ip addresses. It ignores entries that the DNS fails to resolve. This is okay because by default the NN refuses to register the DNs of which it fails to resolve their ips and the hostnames. (see dfs.namenode.datanode.registration.ip-hostname-check for more details) The patch maintains the old behavior which only performs DNS lookups when loading the lists, so that the DNS overhead is minimized. The patch continues to assume that an entry without ports in the list matches all endpoints (i.e., addr:port) that match the IP of the entry. This defines a partial order between endpoints that have the same IPs. More concretely, if two endpoints A and B have the same IPs, then A <= B iff a.port == b.port or b.port == 0. That way checking the include and the exclude list becomes finding the meet or the join elements in the lattice.
          Hide
          Haohui Mai added a comment -

          On a side note, the JMX mbean for deadNodeCount is also broken, I can open another jira for this issue if desired.

          Travis Thompson, please go ahead. If this is indeed a problem, we might need to put it into 2.4 as well since it breaks the front page of the UI.

          Show
          Haohui Mai added a comment - On a side note, the JMX mbean for deadNodeCount is also broken, I can open another jira for this issue if desired. Travis Thompson , please go ahead. If this is indeed a problem, we might need to put it into 2.4 as well since it breaks the front page of the UI.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638622/HDFS-6180.000.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark
          org.apache.hadoop.hdfs.TestDatanodeRegistration
          org.apache.hadoop.hdfs.TestDecommission
          org.apache.hadoop.hdfs.server.namenode.TestCheckpoint
          org.apache.hadoop.hdfs.server.namenode.TestStartup

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6584//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6584//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638622/HDFS-6180.000.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark org.apache.hadoop.hdfs.TestDatanodeRegistration org.apache.hadoop.hdfs.TestDecommission org.apache.hadoop.hdfs.server.namenode.TestCheckpoint org.apache.hadoop.hdfs.server.namenode.TestStartup +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6584//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6584//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638737/HDFS-6180.001.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.TestStartup

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6589//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6589//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638737/HDFS-6180.001.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestStartup +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6589//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6589//console This message is automatically generated.
          Hide
          Haohui Mai added a comment -

          The changes turn out to be a lot bigger than I anticipated. It might be risky to put it in at the very last moment. Moving it to a blocker of 2.5.0.

          Show
          Haohui Mai added a comment - The changes turn out to be a lot bigger than I anticipated. It might be risky to put it in at the very last moment. Moving it to a blocker of 2.5.0.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638769/HDFS-6180.002.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6591//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6591//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638769/HDFS-6180.002.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6591//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6591//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • In getDatanodeListForReport(..), the excludedNodes check should remain inside listDeadNodes. Add nodes to excluded list means starting decommissioning.
          • Using lattice to model the problem is correct but unnecessarily generalizes it. The problem is only a two-level lattice. It is simply a wildcard problem. I suggest renaming the containsMeet and containsJoin methods respectively to isMatched and matches so that a wildcard pattern matches host.
          • parseEntry should pass the file type and use it for the log messages. The log messages should be WARN instead of INFO.
          • Instead of having an entries() method, it is better to have HostSet implementing Iterable since it is already a set. The code will be shorter. We also need a size() for the unit tests.
          • Should "unresolvedAddressFromDatanodeID" be "resolvedAddressFromDatanodeID"?
          Show
          Tsz Wo Nicholas Sze added a comment - In getDatanodeListForReport(..), the excludedNodes check should remain inside listDeadNodes. Add nodes to excluded list means starting decommissioning. Using lattice to model the problem is correct but unnecessarily generalizes it. The problem is only a two-level lattice. It is simply a wildcard problem. I suggest renaming the containsMeet and containsJoin methods respectively to isMatched and matches so that a wildcard pattern matches host. parseEntry should pass the file type and use it for the log messages. The log messages should be WARN instead of INFO. Instead of having an entries() method, it is better to have HostSet implementing Iterable since it is already a set. The code will be shorter. We also need a size() for the unit tests. Should "unresolvedAddressFromDatanodeID" be "resolvedAddressFromDatanodeID"?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... I'm fine with Haohui Mai to work with this. ...

          Mohammad Kamrul Islam, thanks. Let us know if you want to work on other issues. We can help reviewing them.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... I'm fine with Haohui Mai to work with this. ... Mohammad Kamrul Islam , thanks. Let us know if you want to work on other issues. We can help reviewing them.
          Hide
          Haohui Mai added a comment - - edited

          The v4 patch addresses Tsz Wo Nicholas Sze's comments.

          Show
          Haohui Mai added a comment - - edited The v4 patch addresses Tsz Wo Nicholas Sze 's comments.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12638926/HDFS-6180.004.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6596//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6596//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12638926/HDFS-6180.004.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6596//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6596//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good. Thanks, Haohui.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good. Thanks, Haohui.
          Hide
          Haohui Mai added a comment -

          I've committed the patch to trunk and branch-2. Thanks Nicholas for the review.

          Show
          Haohui Mai added a comment - I've committed the patch to trunk and branch-2. Thanks Nicholas for the review.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #5466 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5466/)
          HDFS-6180. Dead node count / listing is very broken in JMX and old GUI. Contributed by Haohui Mai. (wheat9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1585625)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/HostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5466 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5466/ ) HDFS-6180 . Dead node count / listing is very broken in JMX and old GUI. Contributed by Haohui Mai. (wheat9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1585625 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/HostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
          Hide
          Haohui Mai added a comment -

          Sorry, uploaded the wrong patch.

          Show
          Haohui Mai added a comment - Sorry, uploaded the wrong patch.
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #533 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/533/)
          HDFS-6180. Dead node count / listing is very broken in JMX and old GUI. Contributed by Haohui Mai. (wheat9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1585625)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/HostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #533 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/533/ ) HDFS-6180 . Dead node count / listing is very broken in JMX and old GUI. Contributed by Haohui Mai. (wheat9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1585625 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/HostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1751 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1751/)
          HDFS-6180. Dead node count / listing is very broken in JMX and old GUI. Contributed by Haohui Mai. (wheat9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1585625)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/HostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1751 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1751/ ) HDFS-6180 . Dead node count / listing is very broken in JMX and old GUI. Contributed by Haohui Mai. (wheat9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1585625 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/HostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1725 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1725/)
          HDFS-6180. Dead node count / listing is very broken in JMX and old GUI. Contributed by Haohui Mai. (wheat9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1585625)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/HostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHostFileManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1725 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1725/ ) HDFS-6180 . Dead node count / listing is very broken in JMX and old GUI. Contributed by Haohui Mai. (wheat9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1585625 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/HostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDatanodeRegistration.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHostFileManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/NNThroughputBenchmark.java
          Hide
          Mohammad Kamrul Islam added a comment -

          The changes turn out to be a lot bigger than I anticipated. It might be risky to put it in at the very last moment. Moving it to a blocker of 2.5.0.

          What about for the release 2.4.1? It could be coming soon.

          Show
          Mohammad Kamrul Islam added a comment - The changes turn out to be a lot bigger than I anticipated. It might be risky to put it in at the very last moment. Moving it to a blocker of 2.5.0. What about for the release 2.4.1? It could be coming soon.
          Hide
          Colin Patrick McCabe added a comment -

          Thanks for fixing this, Haohui.

          -  @Test(timeout=360000)
          -  public void testIncludeByRegistrationName() throws IOException,
          -      InterruptedException {
          -    Configuration hdfsConf = new Configuration(conf);
          -    final String registrationName = "--registration-name--";
          -    final String nonExistentDn = "127.0.0.40";
          -    hdfsConf.set(DFSConfigKeys.DFS_DATANODE_HOST_NAME_KEY, registrationName);
          

          This test got removed, but I don't see any corresponding test added of registration names (DFS_DATANODE_HOST_NAME_KEY). Was this an oversight?

          Show
          Colin Patrick McCabe added a comment - Thanks for fixing this, Haohui. - @Test(timeout=360000) - public void testIncludeByRegistrationName() throws IOException, - InterruptedException { - Configuration hdfsConf = new Configuration(conf); - final String registrationName = "--registration-name--" ; - final String nonExistentDn = "127.0.0.40" ; - hdfsConf.set(DFSConfigKeys.DFS_DATANODE_HOST_NAME_KEY, registrationName); This test got removed, but I don't see any corresponding test added of registration names (DFS_DATANODE_HOST_NAME_KEY). Was this an oversight?
          Hide
          Haohui Mai added a comment -

          The reason why I removed this test is that -registration-name- is not a valid DNS name. Things like GetBlockLocations, WebHDFS, and DataNodeManager assume that host name is a valid DNS name. For example, WebHDFS uses the hostname to generate the URI pointing to the DN, thus the test might be invalid at the first place. What we probably should have done is to make this assumption explicit.

          To test the case where the user puts a DNS name in the include list, the code in testIncludeExcludeLists adds localhost (which is usually the only valid DNS name in test environment) into the include lists and verifies that the DatanodeManager recognizes it.

          Show
          Haohui Mai added a comment - The reason why I removed this test is that - registration-name - is not a valid DNS name. Things like GetBlockLocations, WebHDFS, and DataNodeManager assume that host name is a valid DNS name. For example, WebHDFS uses the hostname to generate the URI pointing to the DN, thus the test might be invalid at the first place. What we probably should have done is to make this assumption explicit. To test the case where the user puts a DNS name in the include list, the code in testIncludeExcludeLists adds localhost (which is usually the only valid DNS name in test environment) into the include lists and verifies that the DatanodeManager recognizes it.
          Hide
          Colin Patrick McCabe added a comment -

          Hi Haohui,

          See the discussion at HDFS-5237 for some background. Basically, there is this configuration called dfs.datanode.hostname which specifies a datanode's "registration name." This may be different from the first hostname you get by doing a reverse lookup on the DataNode's IP address.

          That's why DatanodeID has three fields instead of two:

          public class DatanodeID implements Comparable<DatanodeID> {
            public static final DatanodeID[] EMPTY_ARRAY = {};
          
            private String ipAddr;     // IP address
            private String hostName;   // hostname claimed by datanode
            private String peerHostName; // hostname from the actual connection
          

          The field named hostName is actually not the hostname, but the "registration name," which is what the datanode was configured to say its name was, via dfs.datanode.hostname. peerHostName is the hostname you get by doing a reverse DNS lookup on ipAddr.

          Part of the use for registration names is in unit tests, where creating a new hostname is not practical. Another use is in dealing with multi-homing setups.

          The reason why I removed this test is that registration-name is not a valid DNS name.

          The point of the test was to ensure that we could specify registration names in the exclude and include files and have them work. We should make sure that this functionality is still working.

          This is a real problem for some people. For example, consider if you have an AWS instance with an external and internal hostname. You might configure your DNs to use dn1.internal.host.name (or whatever) rather than dn1.external.host.name. This avoids the issue where the NN does a reverse DNS lookup on the IP, and comes up with dn1.external.host.name, and starts sending traffic over the wrong interface. This sort of thing is very important on AWS, because people are actually charged money for sending traffic to the external hostname (rather than internal).

          If you like, the test could be configured to use a valid but non-default loopback IP (such as 127.0.5.1) rather than an invalid string. But in any case, I think we need a JIRA to restore it. Will file one shortly.

          Show
          Colin Patrick McCabe added a comment - Hi Haohui, See the discussion at HDFS-5237 for some background. Basically, there is this configuration called dfs.datanode.hostname which specifies a datanode's "registration name." This may be different from the first hostname you get by doing a reverse lookup on the DataNode's IP address. That's why DatanodeID has three fields instead of two: public class DatanodeID implements Comparable<DatanodeID> { public static final DatanodeID[] EMPTY_ARRAY = {}; private String ipAddr; // IP address private String hostName; // hostname claimed by datanode private String peerHostName; // hostname from the actual connection The field named hostName is actually not the hostname, but the "registration name," which is what the datanode was configured to say its name was, via dfs.datanode.hostname . peerHostName is the hostname you get by doing a reverse DNS lookup on ipAddr . Part of the use for registration names is in unit tests, where creating a new hostname is not practical. Another use is in dealing with multi-homing setups. The reason why I removed this test is that registration-name is not a valid DNS name. The point of the test was to ensure that we could specify registration names in the exclude and include files and have them work. We should make sure that this functionality is still working. This is a real problem for some people. For example, consider if you have an AWS instance with an external and internal hostname. You might configure your DNs to use dn1.internal.host.name (or whatever) rather than dn1.external.host.name . This avoids the issue where the NN does a reverse DNS lookup on the IP, and comes up with dn1.external.host.name , and starts sending traffic over the wrong interface. This sort of thing is very important on AWS, because people are actually charged money for sending traffic to the external hostname (rather than internal). If you like, the test could be configured to use a valid but non-default loopback IP (such as 127.0.5.1) rather than an invalid string. But in any case, I think we need a JIRA to restore it. Will file one shortly.
          Hide
          Colin Patrick McCabe added a comment -

          I filed a JIRA for re-adding testIncludeByRegistrationName so we keep this test coverage. I will avoid using a registration name which is not a valid DNS name this time.

          Show
          Colin Patrick McCabe added a comment - I filed a JIRA for re-adding testIncludeByRegistrationName so we keep this test coverage. I will avoid using a registration name which is not a valid DNS name this time.
          Hide
          Haohui Mai added a comment -

          Thanks for fixing this, Colin Patrick McCabe.

          Show
          Haohui Mai added a comment - Thanks for fixing this, Colin Patrick McCabe .

            People

            • Assignee:
              Haohui Mai
              Reporter:
              Travis Thompson
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development