Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Summary of changes to the decommissioning process:
      # After nodes are decommissioned, they are not shutdown. The decommissioned nodes are not used for writes. For reads, the decommissioned nodes are given as the last location to read from.
      # Number of live and dead decommissioned nodes are displayed in the namenode webUI.
      # Decommissioned nodes free capacity is not count towards the the cluster free capacity.
      Show
      Summary of changes to the decommissioning process: # After nodes are decommissioned, they are not shutdown. The decommissioned nodes are not used for writes. For reads, the decommissioned nodes are given as the last location to read from. # Number of live and dead decommissioned nodes are displayed in the namenode webUI. # Decommissioned nodes free capacity is not count towards the the cluster free capacity.

      Description

      Current decommission mechanism driven using exclude file has several issues. This bug proposes some changes in the mechanism for better manageability. See the proposal in the next comment for more details.

      1. HDFS-1547.4.patch
        52 kB
        Suresh Srinivas
      2. HDFS-1547.3.patch
        52 kB
        Suresh Srinivas
      3. HDFS-1547.2.patch
        51 kB
        Suresh Srinivas
      4. show-stats-broken.txt
        26 kB
        Todd Lipcon
      5. HDFS-1547.1.patch
        25 kB
        Suresh Srinivas
      6. HDFS-1547.patch
        25 kB
        Suresh Srinivas

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
          Hide
          Suresh Srinivas added a comment -

          I have created HDFS-1586 to reflect this. Will add pig also into the list.

          Show
          Suresh Srinivas added a comment - I have created HDFS-1586 to reflect this. Will add pig also into the list.
          Hide
          Jakob Homan added a comment -

          Pig also makes extensive use of the MiniDFSCluster...

          Show
          Jakob Homan added a comment - Pig also makes extensive use of the MiniDFSCluster...
          Hide
          Suresh Srinivas added a comment -

          We should annotate MiniDFSCluster as project as @InterfaceAudience.LimitedPrivate(

          {"HDFS", "MapReduce"}

          to reflect this dependency.

          Show
          Suresh Srinivas added a comment - We should annotate MiniDFSCluster as project as @InterfaceAudience.LimitedPrivate( {"HDFS", "MapReduce"} to reflect this dependency.
          Hide
          Todd Lipcon added a comment -

          Looks like this commit broke the trunk MR build. It changed the parameters to startDataNodes which is used by some MR tests. See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/564/console for build errors

          Show
          Todd Lipcon added a comment - Looks like this commit broke the trunk MR build. It changed the parameters to startDataNodes which is used by some MR tests. See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/564/console for build errors
          Hide
          Suresh Srinivas added a comment -

          Hudson is sick. Between the last test report and the new patch, very little changed in the patch. Only related to TestDecommission.

          The failed test TestStorageRestore, is a known failure.

          I am going to commit this patch.

          Show
          Suresh Srinivas added a comment - Hudson is sick. Between the last test report and the new patch, very little changed in the patch. Only related to TestDecommission. The failed test TestStorageRestore, is a known failure. I am going to commit this patch.
          Hide
          Suresh Srinivas added a comment -

          > thanks for taking a look at all my annoying comments
          Not a problem. Thanks for the review, Todd.

          Show
          Suresh Srinivas added a comment - > thanks for taking a look at all my annoying comments Not a problem. Thanks for the review, Todd.
          Hide
          Todd Lipcon added a comment -

          +1, thanks for taking a look at all my annoying comments

          Show
          Todd Lipcon added a comment - +1, thanks for taking a look at all my annoying comments
          Hide
          Suresh Srinivas added a comment -

          New patch to address comments from Todd.

          Show
          Suresh Srinivas added a comment - New patch to address comments from Todd.
          Hide
          Suresh Srinivas added a comment -

          Thanks for the review. I took care of the comments, other than below:

          > I thought this patch wanted to make decommissioning nodes sort lower for block locations also?
          If you follow the discussion, this patch does not shutdown datanodes when they are decommissioned, which is the changed behavior. In order to prevent datanode from being used (which was the reason why the datanode is shutdown), they are listed as the last location. Decommissioning nodes will continue the same way, as today.

          > # in MiniDFSCluster.setupDatanodeAddress, you can use conf.getTrimmed instead of manually calling trim()
          I need getTrimmed(name, defaultValue). I cannot pass default value with the existing API. I do not think it calling a trim outside is such a big deal.

          > Maybe a better name would be DECOMMISSIONED_AT_END_COMPARATOR or something?
          I think DECOM_COMPARATOR name seems sufficient. Wouldn't adding javadoc that explains the intent instead of a long name to describes functionality, suffice?

          > the getFreeSocketPort() trick seems like it's not likely to work repeatably - isn't there a high likelihood that two datanodes would pick the same free port, since you don't track "claimed" ports anywhere? Or that one of these ports might later get claimed by one of the many other daemons running on ephemeral ports in a mini cluster?

          The code ensures the port is free and starts one datanode at a time. Each datanode is setup with it's own port and will use that. The port is tracked as a part of datanode address config setup for the datanode. Not sure how another datanode can take the free port?

          > when the MiniDFS cluster is constructed, shouldn't you clear out the dfs.hosts file? Otherwise you're relying on the test case itself to clean itself up between runs (which differs from the rest of minidfs's storage handling)
          Tests are cleaning this is the approach taken. MiniDFSCluster storage is retained or closed based on the format flag. This is a configuration file outside of that kind of state. Most tests do not care about this configuration. But the ones that do, work with this configuration and do appropriate deletions.

          > is there a test case anywhere that covers what happens when a decom node connects to the namenode? eg after a NN restart when a node is in both include and decom?
          It is a confusion today where people think there is a relationship between include and exlude. There are independent and unrelated. The include file defines datanodes that are part of the cluster. Exclude lists the datanodes that need to be decommissioned.

          testDecommission() tests the case you are mentioning. It restarts the cluster to ensure decommissioned node can connect to NN.

          Show
          Suresh Srinivas added a comment - Thanks for the review. I took care of the comments, other than below: > I thought this patch wanted to make decommissioning nodes sort lower for block locations also? If you follow the discussion, this patch does not shutdown datanodes when they are decommissioned, which is the changed behavior. In order to prevent datanode from being used (which was the reason why the datanode is shutdown), they are listed as the last location. Decommissioning nodes will continue the same way, as today. > # in MiniDFSCluster.setupDatanodeAddress, you can use conf.getTrimmed instead of manually calling trim() I need getTrimmed(name, defaultValue). I cannot pass default value with the existing API. I do not think it calling a trim outside is such a big deal. > Maybe a better name would be DECOMMISSIONED_AT_END_COMPARATOR or something? I think DECOM_COMPARATOR name seems sufficient. Wouldn't adding javadoc that explains the intent instead of a long name to describes functionality, suffice? > the getFreeSocketPort() trick seems like it's not likely to work repeatably - isn't there a high likelihood that two datanodes would pick the same free port, since you don't track "claimed" ports anywhere? Or that one of these ports might later get claimed by one of the many other daemons running on ephemeral ports in a mini cluster? The code ensures the port is free and starts one datanode at a time. Each datanode is setup with it's own port and will use that. The port is tracked as a part of datanode address config setup for the datanode. Not sure how another datanode can take the free port? > when the MiniDFS cluster is constructed, shouldn't you clear out the dfs.hosts file? Otherwise you're relying on the test case itself to clean itself up between runs (which differs from the rest of minidfs's storage handling) Tests are cleaning this is the approach taken. MiniDFSCluster storage is retained or closed based on the format flag. This is a configuration file outside of that kind of state. Most tests do not care about this configuration. But the ones that do, work with this configuration and do appropriate deletions. > is there a test case anywhere that covers what happens when a decom node connects to the namenode? eg after a NN restart when a node is in both include and decom? It is a confusion today where people think there is a relationship between include and exlude. There are independent and unrelated. The include file defines datanodes that are part of the cluster. Exclude lists the datanodes that need to be decommissioned. testDecommission() tests the case you are mentioning. It restarts the cluster to ensure decommissioned node can connect to NN.
          Hide
          Todd Lipcon added a comment -
          • DECOM_COMPARATOR should probably have some javadoc, it's not obvious from the name what it does. (why is there no sort order distinction for decommission_ing_ nodes, but just decommissioned ones? I thought this patch wanted to make decommissioning nodes sort lower for block locations also?)

          Maybe a better name would be DECOMMISSIONED_AT_END_COMPARATOR or something? It's a bit long but not often used and clearer what it does.

          • spurious whitespace change on setDatanodeDead() function and javadoc for handleHeartbeat
          • in generateNodesList, the word decommissioned is misspelled at one point with too few 's'es
          • in MiniDFSCluster.setupDatanodeAddress, you can use conf.getTrimmed instead of manually calling trim()
          • the getFreeSocketPort() trick seems like it's not likely to work repeatably - isn't there a high likelihood that two datanodes would pick the same free port, since you don't track "claimed" ports anywhere? Or that one of these ports might later get claimed by one of the many other daemons running on ephemeral ports in a mini cluster?
          • when the MiniDFS cluster is constructed, shouldn't you clear out the dfs.hosts file? Otherwise you're relying on the test case itself to clean itself up between runs (which differs from the rest of minidfs's storage handling)
          • in the test case verifyStats method, it seems we should sleep for at least some number of millis, or write a function which will wait for heartbeats (eg like TestDatanodeRegistration.java:62). Otherwise the 10 quick iterations might run before any heartbeats actually came in.
          • is there a test case anywhere that covers what happens when a decom node connects to the namenode? eg after a NN restart when a node is in both include and decom?
          Show
          Todd Lipcon added a comment - DECOM_COMPARATOR should probably have some javadoc, it's not obvious from the name what it does. (why is there no sort order distinction for decommission_ing_ nodes, but just decommissioned ones? I thought this patch wanted to make decommissioning nodes sort lower for block locations also?) Maybe a better name would be DECOMMISSIONED_AT_END_COMPARATOR or something? It's a bit long but not often used and clearer what it does. spurious whitespace change on setDatanodeDead() function and javadoc for handleHeartbeat in generateNodesList, the word decommissioned is misspelled at one point with too few 's'es in MiniDFSCluster.setupDatanodeAddress, you can use conf.getTrimmed instead of manually calling trim() the getFreeSocketPort() trick seems like it's not likely to work repeatably - isn't there a high likelihood that two datanodes would pick the same free port, since you don't track "claimed" ports anywhere? Or that one of these ports might later get claimed by one of the many other daemons running on ephemeral ports in a mini cluster? when the MiniDFS cluster is constructed, shouldn't you clear out the dfs.hosts file? Otherwise you're relying on the test case itself to clean itself up between runs (which differs from the rest of minidfs's storage handling) in the test case verifyStats method, it seems we should sleep for at least some number of millis, or write a function which will wait for heartbeats (eg like TestDatanodeRegistration.java:62). Otherwise the 10 quick iterations might run before any heartbeats actually came in. is there a test case anywhere that covers what happens when a decom node connects to the namenode? eg after a NN restart when a node is in both include and decom?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 the new patch looks good.

          Todd, nice catch.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 the new patch looks good. Todd, nice catch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12468302/HDFS-1547.2.patch
          against trunk revision 1058402.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 13 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.namenode.TestStorageRestore

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/107//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/107//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/107//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12468302/HDFS-1547.2.patch against trunk revision 1058402. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 13 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestStorageRestore -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/107//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/107//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/107//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          Additional change to test cluster stats when a datanode decommissioning is stopped.

          Show
          Suresh Srinivas added a comment - Additional change to test cluster stats when a datanode decommissioning is stopped.
          Hide
          Suresh Srinivas added a comment -

          Todd and Nicholas, can you please review the new patch.

          Show
          Suresh Srinivas added a comment - Todd and Nicholas, can you please review the new patch.
          Hide
          Suresh Srinivas added a comment -

          Attached patch fixes the following:

          1. Cluster stats updates had a bug. Thanks Todd for pointing it out. I have fixed it and added tests to check this.
          2. Node removed from include file was not being shutdown. I added this functionality back and added a test to test this.
          3. Decommissioning/decommissioned update the stats as given below:
            • Node used capacity is counted towards cluster used capacity.
            • Node capacity is not counted towards the cluster capacity. Only node used capacity is used counted towards cluster capacity.
            • Node capacity remaining is not counted towards cluster capacity remaining.
          4. Cleaned up TestDecommission and moved it to junit4.
          Show
          Suresh Srinivas added a comment - Attached patch fixes the following: Cluster stats updates had a bug. Thanks Todd for pointing it out. I have fixed it and added tests to check this. Node removed from include file was not being shutdown. I added this functionality back and added a test to test this. Decommissioning/decommissioned update the stats as given below: Node used capacity is counted towards cluster used capacity. Node capacity is not counted towards the cluster capacity. Only node used capacity is used counted towards cluster capacity. Node capacity remaining is not counted towards cluster capacity remaining. Cleaned up TestDecommission and moved it to junit4.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12468182/show-stats-broken.txt
          against trunk revision 1057414.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/99//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/99//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/99//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12468182/show-stats-broken.txt against trunk revision 1057414. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestStorageRestore org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/99//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/99//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/99//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          The suggestion you provided for including transceiver count for * decommissioned * node will not work. Given that decomissioned nodes are provided as the last location for reads, very rarely they will be used for reading and hence will not affect the average load. However, I do not mind adding it, even if it marginally improves the situation.

          You are correct, the update stats in the current patch has a bug. I will fix it and add a unit test to cover it.

          Show
          Suresh Srinivas added a comment - The suggestion you provided for including transceiver count for * decommissioned * node will not work. Given that decomissioned nodes are provided as the last location for reads, very rarely they will be used for reading and hence will not affect the average load. However, I do not mind adding it, even if it marginally improves the situation. You are correct, the update stats in the current patch has a bug. I will fix it and add a unit test to cover it.
          Hide
          Todd Lipcon added a comment -

          I think my concern can be alleviated by simply moving the "totalLoad += node.getXceiverCount" change outside the condition for decommissioning.

          On a separate note in the same function - the modified accounting isn't correct. Here's what happens:

          • we have a DN in normal state, so it's represented in the stats
          • we call refreshNodes to put it in decom state
          • next heartbeat:
            • we call updateStats(node, false) to remove it -> no longer presented in stats
            • we call updateStats(node, true) to re-add it, but it's in decom state, so it doesn't get added (good)
          • next heartbeat:
            • we call updateStats(node, false) again, and stats get decremented again.
            • on updateStats(node, true) it doesn't increment because it's in DECOM state

          We don't appear to have any test cases that look at this, but I just added some logging (see attached patch) and can see as soon as the node enters DECOMMISSIONING state, the cluster stats drop into the negatives and keep on dropping.

          Show
          Todd Lipcon added a comment - I think my concern can be alleviated by simply moving the "totalLoad += node.getXceiverCount" change outside the condition for decommissioning. On a separate note in the same function - the modified accounting isn't correct. Here's what happens: we have a DN in normal state, so it's represented in the stats we call refreshNodes to put it in decom state next heartbeat: we call updateStats(node, false) to remove it -> no longer presented in stats we call updateStats(node, true) to re-add it, but it's in decom state, so it doesn't get added (good) next heartbeat: we call updateStats(node, false) again, and stats get decremented again. on updateStats(node, true) it doesn't increment because it's in DECOM state We don't appear to have any test cases that look at this, but I just added some logging (see attached patch) and can see as soon as the node enters DECOMMISSIONING state, the cluster stats drop into the negatives and keep on dropping.
          Hide
          Suresh Srinivas added a comment -

          > But will the decommissioning itself actually be able to proceed?
          From my previous comment:
          "It reduces cluster available free storage for writes. Writes could simply fail because of no free storage. The decommissioning may not complete, because of lack of free storage."

          I am not sure what you mean by the deadlock situation. Removing nodes from exclude stops decommissioning and the cluster should get back to normal state.

          > will the NN be able to pick new locations for the blocks previously stored on the decomissioning nodes
          I assume you mean decommissioned nodes (the patch introduces no change to decommissioning nodes and how they are handled). Decommissioned replicas are chosen as the last location for reads. If all the replicas of a block are decommissioned, then decomissioned node will be used for reading it.

          Show
          Suresh Srinivas added a comment - > But will the decommissioning itself actually be able to proceed? From my previous comment: "It reduces cluster available free storage for writes. Writes could simply fail because of no free storage. The decommissioning may not complete, because of lack of free storage." I am not sure what you mean by the deadlock situation. Removing nodes from exclude stops decommissioning and the cluster should get back to normal state. > will the NN be able to pick new locations for the blocks previously stored on the decomissioning nodes I assume you mean decommissioned nodes (the patch introduces no change to decommissioning nodes and how they are handled). Decommissioned replicas are chosen as the last location for reads. If all the replicas of a block are decommissioned, then decomissioned node will be used for reading it.
          Hide
          Todd Lipcon added a comment -

          I agree the cluster probably won't be in use for such a case. But will the decommissioning itself actually be able to proceed? ie will the NN be able to pick new locations for the blocks previously stored on the decomissioning nodes? Or will we end up in a sort of deadlock situation.

          Show
          Todd Lipcon added a comment - I agree the cluster probably won't be in use for such a case. But will the decommissioning itself actually be able to proceed? ie will the NN be able to pick new locations for the blocks previously stored on the decomissioning nodes? Or will we end up in a sort of deadlock situation.
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • 10 nodes, each of which has 2 transceivers active (total load = 20)
          • decommission 6 of them

          Hi Todd, this is a good observation but I think we should never decommission 60% of nodes at once. Or the cluster should not be being used with decommissioning 60% of nodes. Just like that we don't expect the cluster is still working fine when many nodes, say > 50%, are down. So I doubt if this is a valid use case.

          Show
          Tsz Wo Nicholas Sze added a comment - 10 nodes, each of which has 2 transceivers active (total load = 20) decommission 6 of them Hi Todd, this is a good observation but I think we should never decommission 60% of nodes at once. Or the cluster should not be being used with decommissioning 60% of nodes. Just like that we don't expect the cluster is still working fine when many nodes, say > 50%, are down. So I doubt if this is a valid use case.
          Hide
          Suresh Srinivas added a comment -

          Just to clarify my previous comment:
          > Note this problem also exists when decommissioning is in progress for large number of nodes.
          This is the current behavior for decommisioning nodes (without the patch) and has not caused real issues. Decommissioning more than half the nodes and expecting cluster to run smoothly does not seem like the right thing to do, because of several issues I have enumerated above.

          Show
          Suresh Srinivas added a comment - Just to clarify my previous comment: > Note this problem also exists when decommissioning is in progress for large number of nodes. This is the current behavior for decommisioning nodes (without the patch) and has not caused real issues. Decommissioning more than half the nodes and expecting cluster to run smoothly does not seem like the right thing to do, because of several issues I have enumerated above.
          Hide
          Suresh Srinivas added a comment -

          Thinking a bit more about the problem, I think there could be issues in some cases:
          Consider a cluster with N nodes, L live and D decommissioned with transceiver load on each datanode

          {X1, X2, ... XN}

          .

          A datanode is not good for write when Xi > 2 * X /(L+D)

          That means when D > L, a lot of the nodes will be not eligible for writes. The remainining that are good, will have to take write load and will push X higher. Also read traffic that is not subject to the above condition will push X higher. In the worst case scenarios, if the load on every node is equal to X and write load dominates reads, then very few or no nodes are good for writes!

          Some observations:

          1. This problem is severe as D gets closer to and more than N/2.
          2. Doing such a decommission of large number datanodes has several issues:
            • It reduces cluster available free storage for writes. Writes could simply fail because of no free storage. The decommissioning may not complete, because of lack of free storage.
            • Further when this happens, the number nodes available for writes is significantly reduced (as writes are not done to D nodes).
            • Note this problem also exists when decommissioning is in progress for large number of nodes.

          Given this I am leaning towards not handling this case.

          Show
          Suresh Srinivas added a comment - Thinking a bit more about the problem, I think there could be issues in some cases: Consider a cluster with N nodes, L live and D decommissioned with transceiver load on each datanode {X1, X2, ... XN} . A datanode is not good for write when Xi > 2 * X /(L+D) That means when D > L, a lot of the nodes will be not eligible for writes. The remainining that are good, will have to take write load and will push X higher. Also read traffic that is not subject to the above condition will push X higher. In the worst case scenarios, if the load on every node is equal to X and write load dominates reads, then very few or no nodes are good for writes! Some observations: This problem is severe as D gets closer to and more than N/2. Doing such a decommission of large number datanodes has several issues: It reduces cluster available free storage for writes. Writes could simply fail because of no free storage. The decommissioning may not complete, because of lack of free storage. Further when this happens, the number nodes available for writes is significantly reduced (as writes are not done to D nodes). Note this problem also exists when decommissioning is in progress for large number of nodes. Given this I am leaning towards not handling this case.
          Hide
          Todd Lipcon added a comment -

          The issue with isGoodTarget() is that it doesn't just prioritize low-load nodes, it actively excludes any node with load > 2*avgLoad. Consider this situation:

          • 10 nodes, each of which has 2 transceivers active (total load = 20)
          • decommission 6 of them
          • total load is now only 8 (from the remaining ones) but "size" is still 10
          • avgLoad = 0.8
          • any node with load >1.6 will not be chosen
          • thus all nodes are excluded and we cannot allocate a block

          That is to say, decomissioning nodes now contribute to the denominator of the total/numNodes fraction but not the numerator, so it drives the average down.

          Does this make sense?

          Show
          Todd Lipcon added a comment - The issue with isGoodTarget() is that it doesn't just prioritize low-load nodes, it actively excludes any node with load > 2*avgLoad. Consider this situation: 10 nodes, each of which has 2 transceivers active (total load = 20) decommission 6 of them total load is now only 8 (from the remaining ones) but "size" is still 10 avgLoad = 0.8 any node with load >1.6 will not be chosen thus all nodes are excluded and we cannot allocate a block That is to say, decomissioning nodes now contribute to the denominator of the total/numNodes fraction but not the numerator, so it drives the average down. Does this make sense?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12468084/HDFS-1547.1.patch
          against trunk revision 1057414.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.namenode.TestStorageRestore

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/98//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/98//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/98//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12468084/HDFS-1547.1.patch against trunk revision 1057414. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.namenode.TestStorageRestore -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/98//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/98//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/98//console This message is automatically generated.
          Hide
          Suresh Srinivas added a comment -

          I do not think this is a problem. Assuming you are talking about isGoodTarget() method, the intent is to keep the trasceiver count below 2 * (average load of the cluster). This only prevents machines loaded more than this count from being chosen, distributing load across the nodes. The average load keeps growing as more transceivers are needed.

          To put it in another way, when the cluster is very lightly loaded, the average load is approximately zero. This results in node with no xceiver count chosen over any node with 1 xceiver count.

          Show
          Suresh Srinivas added a comment - I do not think this is a problem. Assuming you are talking about isGoodTarget() method, the intent is to keep the trasceiver count below 2 * (average load of the cluster). This only prevents machines loaded more than this count from being chosen, distributing load across the nodes. The average load keeps growing as more transceivers are needed. To put it in another way, when the cluster is very lightly loaded, the average load is approximately zero. This results in node with no xceiver count chosen over any node with 1 xceiver count.
          Hide
          Todd Lipcon added a comment -

          One potential issue, not sure if it's realistically a problem: In BlockPlacementPolicyDefault, the average load on the cluster is calculated by dividing FSN.totalLoad by the number of nodes. Since totalLoad no longer includes decomissioning nodes, I wonder whether this could create a situation where all nodes are considered "overloaded" and we can't allocate new blocks.

          Show
          Todd Lipcon added a comment - One potential issue, not sure if it's realistically a problem: In BlockPlacementPolicyDefault, the average load on the cluster is calculated by dividing FSN.totalLoad by the number of nodes. Since totalLoad no longer includes decomissioning nodes, I wonder whether this could create a situation where all nodes are considered "overloaded" and we can't allocate new blocks.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1

          Show
          Tsz Wo Nicholas Sze added a comment - +1
          Hide
          Suresh Srinivas added a comment -

          Thanks for the suggestion. I made the change in the patch.

          Show
          Suresh Srinivas added a comment - Thanks for the suggestion. I made the change in the patch.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Patch looks good. Just one comment:

          You may define a comparator in DatanodeInfo.

            public static final Comparator<DatanodeInfo> DECOMMISSION_CMP = new Comparator<DatanodeInfo>() {
              @Override
              public int compare(DatanodeInfo a, DatanodeInfo b) {
                return a.isDecommissioned() == b.isDecommissioned() ? 0
                    : a.isDecommissioned()? 1 : -1;
              }
            };
          

          Then, sorting is very simple.

              Arrays.sort(locations, DatanodeInfo.DECOMMISSION_CMP);
          

          Of course, we should also use it in JspHelper

          +        case FIELD_DECOMMISSIONED:
          +          ret = DatanodeInfo.DECOMMISSION_CMP.compare(d1, d2);
          +          break;
          
          Show
          Tsz Wo Nicholas Sze added a comment - Patch looks good. Just one comment: You may define a comparator in DatanodeInfo. public static final Comparator<DatanodeInfo> DECOMMISSION_CMP = new Comparator<DatanodeInfo>() { @Override public int compare(DatanodeInfo a, DatanodeInfo b) { return a.isDecommissioned() == b.isDecommissioned() ? 0 : a.isDecommissioned()? 1 : -1; } }; Then, sorting is very simple. Arrays.sort(locations, DatanodeInfo.DECOMMISSION_CMP); Of course, we should also use it in JspHelper + case FIELD_DECOMMISSIONED: + ret = DatanodeInfo.DECOMMISSION_CMP.compare(d1, d2); + break ;
          Hide
          Suresh Srinivas added a comment -

          I am not using review board as I could not register with review board. Having hard time with re-captch (not sure some thing is wrong with the backend).

          Show
          Suresh Srinivas added a comment - I am not using review board as I could not register with review board. Having hard time with re-captch (not sure some thing is wrong with the backend).
          Hide
          Suresh Srinivas added a comment -

          In addition to the proposed changes I made the following changes:
          > In Live Nodes web page, in addition to live nodes listed today, a separate table (not column) will list decommissioned nodes.
          I noticed that there is a column already in live nodes table that lists admin state. I made it sortable to find the decommissioned nodes, instead of adding a new table.

          1. In Dead Nodes web page, in addition to live nodes listed today, a separate table (not column) will list decommissioned nodes.
            I added a colum that indicates if the node is decommissioned (true/false). It is sortable.

          The reason for not adding a separate table is, it make sorting tables confusing and the existing jsp page more complicated.

          Other changes:

          1. Added decommissioned state to JMX methods FSNameSystem#getLiveNodes() and FSNameSystem#getDeadNodes()
          2. Changed the DFSAdmin command to fix incorrect explanation of include/exclude file.
          3. Update DFSAdmin -refreshNodes document the new behavior.
          Show
          Suresh Srinivas added a comment - In addition to the proposed changes I made the following changes: > In Live Nodes web page, in addition to live nodes listed today, a separate table (not column) will list decommissioned nodes. I noticed that there is a column already in live nodes table that lists admin state. I made it sortable to find the decommissioned nodes, instead of adding a new table. In Dead Nodes web page, in addition to live nodes listed today, a separate table (not column) will list decommissioned nodes. I added a colum that indicates if the node is decommissioned (true/false). It is sortable. The reason for not adding a separate table is, it make sorting tables confusing and the existing jsp page more complicated. Other changes: Added decommissioned state to JMX methods FSNameSystem#getLiveNodes() and FSNameSystem#getDeadNodes() Changed the DFSAdmin command to fix incorrect explanation of include/exclude file. Update DFSAdmin -refreshNodes document the new behavior.
          Hide
          Sanjay Radia added a comment -

          +1
          Good analysis, good summary. Thanks suresh.

          Show
          Sanjay Radia added a comment - +1 Good analysis, good summary. Thanks suresh.
          Hide
          Suresh Srinivas added a comment -

          Thanks guys for the comments.

          Summary of proposed changes:

          1. When datanodes are added to exclude file:
            • Currently registered datanodes will be decommissioned. (same as today)
            • Previously registered datanodes after namenode start, when registers back will be decommissioned. (same as today)
            • New datanodes will be allowed to register and will be decommissioned. (changed)
            • After a node is decommissioned it is allowed to communicate with the namenode. (changed).
            • Decommissioned node and decommissioning node free storage capacity does not count towards the free storage capacity of the cluster. (changed)
          2. NameNode WebUI changes:
            • In cluster summary, live namenodes will show additional decommissioned node count "Live Nodes : <Live node count> (Decommissioned <Decommissioned count>)"
            • In cluster summary, dead namenodes will show additional decommissioned node count "Dead Nodes : <Dead node count> (Decommissioned <Decommissioned count>)"
            • In Live Nodes web page, in addition to live nodes listed today, a separate table (not column) will list decommissioned nodes.
            • In Dead Nodes web page, in addition to live nodes listed today, a separate table (not column) will list decommissioned nodes.
          3. I will rename the configuration parameter from "dfs.hosts.exclude" to "dfs.hosts.decom". Will use key deprecation mechanism to support the older config param "dfs.host.exclude", for backward compatibility.
          4. The documentation associated with include/exclude file is confusing and incorrect in some places. Will update the doc as well.

          There are other enhancements that came out of our discussions. I will open jiras to track them.

          Show
          Suresh Srinivas added a comment - Thanks guys for the comments. Summary of proposed changes: When datanodes are added to exclude file: Currently registered datanodes will be decommissioned. (same as today) Previously registered datanodes after namenode start, when registers back will be decommissioned. (same as today) New datanodes will be allowed to register and will be decommissioned. (changed) After a node is decommissioned it is allowed to communicate with the namenode. (changed). Decommissioned node and decommissioning node free storage capacity does not count towards the free storage capacity of the cluster. (changed) NameNode WebUI changes: In cluster summary, live namenodes will show additional decommissioned node count "Live Nodes : <Live node count> (Decommissioned <Decommissioned count>)" In cluster summary, dead namenodes will show additional decommissioned node count "Dead Nodes : <Dead node count> (Decommissioned <Decommissioned count>)" In Live Nodes web page, in addition to live nodes listed today, a separate table (not column) will list decommissioned nodes. In Dead Nodes web page, in addition to live nodes listed today, a separate table (not column) will list decommissioned nodes. I will rename the configuration parameter from "dfs.hosts.exclude" to "dfs.hosts.decom". Will use key deprecation mechanism to support the older config param "dfs.host.exclude", for backward compatibility. The documentation associated with include/exclude file is confusing and incorrect in some places. Will update the doc as well. There are other enhancements that came out of our discussions. I will open jiras to track them.
          Hide
          Suresh Srinivas added a comment -

          > include file should rarely change (only when new namenodes are added)
          I meant datanodes

          Show
          Suresh Srinivas added a comment - > include file should rarely change (only when new namenodes are added) I meant datanodes
          Hide
          Suresh Srinivas added a comment -

          Scott, if you follow the entire discussion the idea of adding third file has been dropped. We will retain only two files.
          include - cluster configuration file which lists datanodes in the cluster
          exclude(reame it to decom file?) - file to decommission the nodes. These nodes are listed in (thanks Dhruba) as the last location for block to satisfy the intent from HADOOP-442.

          The reason I think this is better is - include file should rarely change (only when new namenodes are added). Compared to that exclude file will change more frequently.

          Current documentation alludes to relationship between include and exclude, describing the behavior when a datanode is in include and not exclude, in exclude and not in include and is in both the files. This is not necessary any more. include file is datanodes that make the cluster. exclude(or decom) is for decommissioning. I will update the document to reflect this.

          Show
          Suresh Srinivas added a comment - Scott, if you follow the entire discussion the idea of adding third file has been dropped. We will retain only two files. include - cluster configuration file which lists datanodes in the cluster exclude(reame it to decom file?) - file to decommission the nodes. These nodes are listed in (thanks Dhruba) as the last location for block to satisfy the intent from HADOOP-442 . The reason I think this is better is - include file should rarely change (only when new namenodes are added). Compared to that exclude file will change more frequently. Current documentation alludes to relationship between include and exclude, describing the behavior when a datanode is in include and not exclude, in exclude and not in include and is in both the files. This is not necessary any more. include file is datanodes that make the cluster. exclude(or decom) is for decommissioning. I will update the document to reflect this.
          Hide
          Scott Carey added a comment -

          I like Todd's proposal to have only one file, that lists each node at most once, and do not see any explanation why it won't work.

          A node has only one state from the administrator POV, and what should be shown in the UI (dead, decomission in progress, etc) can be derived from that.

          Why have 3 files when one will do? Its only more confusing.

          Yes, the current two file format has issues because the meaning is overloaded or the names are bas. But a single file with a format like Todd suggests seems like it would work. Possible format:

          node1=active
          node2=decommission
          node3=exclude
          

          When an administrator wants to decommission a node, the part after the = in the file for that node is changed from active to decommission. Nodes in the decommission state are allowed to talk to the NN and register with it, but will shut down after successful decommission. Nodes marked exclude are not allowed to talk to the NN. Nodes marked active are tracked and compared to what is regisered (along with decommission marked nodes) to identify dead nodes.

          In short, all three files in this proposal could be combined into one.

          Show
          Scott Carey added a comment - I like Todd's proposal to have only one file, that lists each node at most once, and do not see any explanation why it won't work. A node has only one state from the administrator POV, and what should be shown in the UI (dead, decomission in progress, etc) can be derived from that. Why have 3 files when one will do? Its only more confusing. Yes, the current two file format has issues because the meaning is overloaded or the names are bas. But a single file with a format like Todd suggests seems like it would work. Possible format: node1=active node2=decommission node3=exclude When an administrator wants to decommission a node, the part after the = in the file for that node is changed from active to decommission. Nodes in the decommission state are allowed to talk to the NN and register with it, but will shut down after successful decommission. Nodes marked exclude are not allowed to talk to the NN. Nodes marked active are tracked and compared to what is regisered (along with decommission marked nodes) to identify dead nodes. In short, all three files in this proposal could be combined into one.
          Hide
          dhruba borthakur added a comment -

          > It is not clear if the "decommisioned" state will have active heartbeats.

          I agree. I am suggesting that this JIRA do not mess with heartbeats/blockreport frequencies for decommissioned datanodes. That is mostly an optimization and I am not sue whether it is worth the complexity, of course we can discuss it in another jira.

          > I am concerned not shutting down datanode could result in namenode using decommissioned nodes in unintended ways.

          This is a vald concern. What if the list of datanodes returned via getBlockLocations is such that the datanodes that are in the "decommisioned" state are always listed last? This will decrease the probability that clients will ever use the decommissioned nodes to read data. That should help alleviate this concern, isn't it?

          Otherwise, +1 to Suresh's proposal.

          Show
          dhruba borthakur added a comment - > It is not clear if the "decommisioned" state will have active heartbeats. I agree. I am suggesting that this JIRA do not mess with heartbeats/blockreport frequencies for decommissioned datanodes. That is mostly an optimization and I am not sue whether it is worth the complexity, of course we can discuss it in another jira. > I am concerned not shutting down datanode could result in namenode using decommissioned nodes in unintended ways. This is a vald concern. What if the list of datanodes returned via getBlockLocations is such that the datanodes that are in the "decommisioned" state are always listed last? This will decrease the probability that clients will ever use the decommissioned nodes to read data. That should help alleviate this concern, isn't it? Otherwise, +1 to Suresh's proposal.
          Hide
          Suresh Srinivas added a comment -

          > Does the dfsadmin -refreshNodes command upload the excludes and includes and the decom file (all three?) into the namenode's in memory state?
          It is good to persist this at namenode. That way post namenode restart, the datanodes that were intended to be out of service will not come back into service.

          > when a node is being-decommissioned state, why do you propose that it reduces the frequency of block reports and heartbeats? Is this really needed?...
          Sanjay's comment addresses this.

          > I really like the fact that you are proposing the decommissioned nodes are not auto shutdown.
          This was my original proposal. After thinking a bit, I see following issues:

          • Not shutting down datanodes changes the intent of HADOOP-442; shutting down datanode ensures problematic datanodes cannot be used any more.
          • Currently the shutdown ensures datanodes are not used by the namenode. I am concerned not shutting down datanode could result in namenode using using decommissioned nodes in unintended ways.
          • My concern earlier was, there is no way to figure out if datanode is dead because decommission is complete or for other reasons. However namenode has the state that datanode is decommissioned. We could improve current dead node list to show two lists, decommissioned and dead list in namenode WebUI.
          • The storage free capacity from the decommissioned datanodes should not be counted towards available storage capacity of the cluster. Only used capacity should count towards cluster used capacity.

          Current behavior is:

          1. A currently registered datanode is decommissioned and then disallowed to communicate with NN.
          2. A datanode that had registered previously with NN (after NN is restarted) and currently not registered is decommissioned, if it registers with NN.
          3. A datanode that had not registered (after NN restart) is disallowed from registering and is never decommissioned.

          By changing the behavior (3), most of what I had proposed for decom file can be achieved. This also avoids two config files for exclude and decom with very little and subtle sematic difference.

          Show
          Suresh Srinivas added a comment - > Does the dfsadmin -refreshNodes command upload the excludes and includes and the decom file (all three?) into the namenode's in memory state? It is good to persist this at namenode. That way post namenode restart, the datanodes that were intended to be out of service will not come back into service. > when a node is being-decommissioned state, why do you propose that it reduces the frequency of block reports and heartbeats? Is this really needed?... Sanjay's comment addresses this. > I really like the fact that you are proposing the decommissioned nodes are not auto shutdown. This was my original proposal. After thinking a bit, I see following issues: Not shutting down datanodes changes the intent of HADOOP-442 ; shutting down datanode ensures problematic datanodes cannot be used any more. Currently the shutdown ensures datanodes are not used by the namenode. I am concerned not shutting down datanode could result in namenode using using decommissioned nodes in unintended ways. My concern earlier was, there is no way to figure out if datanode is dead because decommission is complete or for other reasons. However namenode has the state that datanode is decommissioned. We could improve current dead node list to show two lists, decommissioned and dead list in namenode WebUI. The storage free capacity from the decommissioned datanodes should not be counted towards available storage capacity of the cluster. Only used capacity should count towards cluster used capacity. Current behavior is: A currently registered datanode is decommissioned and then disallowed to communicate with NN. A datanode that had registered previously with NN (after NN is restarted) and currently not registered is decommissioned, if it registers with NN. A datanode that had not registered (after NN restart) is disallowed from registering and is never decommissioned. By changing the behavior (3), most of what I had proposed for decom file can be achieved. This also avoids two config files for exclude and decom with very little and subtle sematic difference.
          Hide
          Sanjay Radia added a comment -

          >when a node is being-decommissioned state, why do you propose that it reduces the frequency ..
          The reduced frequency is not for the "being-decommissioned" state but perhaps for the "decommisioned" state.
          I think longer term (perhaps another jira) a NN returns to a DN a state:

          • "active" - the DN is registered with NN as active
          • "being-decommissioned" - the DN is registered and is being decommission. This state will be useful to break pipelines so that decommission can continue and finish in spite of long running slow pipeline.
          • "decommisioned" - you have been successfully decommissioned.

          The heartbeat and BR intervals for "active" and "decommisioning" are the same.
          It is not clear if the "decommisioned" state will have active heartbeats (perhaps at a slower interval) or any BRs. This is longer discussion (Suresh and I have debated this for long and haven't come to an agreement yet). This state change and handshake protocol change is probably best done in a separate jira.

          Show
          Sanjay Radia added a comment - >when a node is being-decommissioned state, why do you propose that it reduces the frequency .. The reduced frequency is not for the "being-decommissioned" state but perhaps for the "decommisioned" state. I think longer term (perhaps another jira) a NN returns to a DN a state: "active" - the DN is registered with NN as active "being-decommissioned" - the DN is registered and is being decommission. This state will be useful to break pipelines so that decommission can continue and finish in spite of long running slow pipeline. "decommisioned" - you have been successfully decommissioned. The heartbeat and BR intervals for "active" and "decommisioning" are the same. It is not clear if the "decommisioned" state will have active heartbeats (perhaps at a slower interval) or any BRs. This is longer discussion (Suresh and I have debated this for long and haven't come to an agreement yet). This state change and handshake protocol change is probably best done in a separate jira.
          Hide
          Sanjay Radia added a comment -

          >4. I really like the fact that you are proposing the decommissioned nodes are not auto shutdown.
          Suresh thinks I paid you say this (I will send you the $10 shortly).
          +++++1
          One cannot infer that a DN completed its decommission by the fact that it shutdown. A DN may die for an unknown reason. One has to check the
          NN state to see if a DN was successfully decommissioned.

          I also agree that since we don't need to support the original motivation for exclude file; there is one file and is used for decommisioning.
          I could live with using the name "excludes" but would prefer to give it the name "decomissioned"; yes it breaks compatibility but going forward
          it will have a clean simple meaning and no one will need read some documentation which says "excludes" use to mean XXX + decommision and now it means decommision. Lets name this file "decommision" and say that "excludes" is not supported.

          Show
          Sanjay Radia added a comment - >4. I really like the fact that you are proposing the decommissioned nodes are not auto shutdown. Suresh thinks I paid you say this (I will send you the $10 shortly). +++++1 One cannot infer that a DN completed its decommission by the fact that it shutdown. A DN may die for an unknown reason. One has to check the NN state to see if a DN was successfully decommissioned. I also agree that since we don't need to support the original motivation for exclude file; there is one file and is used for decommisioning. I could live with using the name "excludes" but would prefer to give it the name "decomissioned"; yes it breaks compatibility but going forward it will have a clean simple meaning and no one will need read some documentation which says "excludes" use to mean XXX + decommision and now it means decommision. Lets name this file "decommision" and say that "excludes" is not supported.
          Hide
          dhruba borthakur added a comment -

          Thanks suresh for addressing this longstanding problem.

          1. Does the dfsadmin -refreshNodes command upload the excludes and includes and the decom file (all three?) into the namenode's in memory state?

          2. when a node is being-decommissioned state, why do you propose that it reduces the frequency of block reports and heartbeats? Is this really needed? This looks like an optimization to me. A decommissioned node can continue to serve read-requests, so we should continue to handle it just like any other live node as far as heartbeat and block report periodicity is concerned, isn't it?

          3. I am fine if you propose that we stop supporting decommissioning via the excludes file. It is an incompatible change, but not a catastrophic one and the release-notes can clearly spell this out.

          4. I really like the fact that you are proposing the decommissioned nodes are not auto shutdown.

          Show
          dhruba borthakur added a comment - Thanks suresh for addressing this longstanding problem. 1. Does the dfsadmin -refreshNodes command upload the excludes and includes and the decom file (all three?) into the namenode's in memory state? 2. when a node is being-decommissioned state, why do you propose that it reduces the frequency of block reports and heartbeats? Is this really needed? This looks like an optimization to me. A decommissioned node can continue to serve read-requests, so we should continue to handle it just like any other live node as far as heartbeat and block report periodicity is concerned, isn't it? 3. I am fine if you propose that we stop supporting decommissioning via the excludes file. It is an incompatible change, but not a catastrophic one and the release-notes can clearly spell this out. 4. I really like the fact that you are proposing the decommissioned nodes are not auto shutdown.
          Hide
          Sanjay Radia added a comment -

          >> Have you considered adding RPCs (and CLI wrappers for them) for explicitly updating data node states...
          What you are proposing is doable (in a separate jira) but realize currently NN only persists namespace and leases and that this would be a new category of
          persistent state on NN side.

          Show
          Sanjay Radia added a comment - >> Have you considered adding RPCs (and CLI wrappers for them) for explicitly updating data node states... What you are proposing is doable (in a separate jira) but realize currently NN only persists namespace and leases and that this would be a new category of persistent state on NN side.
          Hide
          Suresh Srinivas added a comment -

          The way the files are named is confusing. With names such as include and exclude, one would think they are mutually exclusive.

          include file is a list of datanodes in the cluster. Some of the nodes that have problems (access over ssh is not possible) etc. are excluded from by adding it to exclude list. See HADOOP-442 for details. Naming include file to another name, say datanodes and not the opposite of exclude would have reduced confusion.

          Show
          Suresh Srinivas added a comment - The way the files are named is confusing. With names such as include and exclude, one would think they are mutually exclusive. include file is a list of datanodes in the cluster. Some of the nodes that have problems (access over ssh is not possible) etc. are excluded from by adding it to exclude list. See HADOOP-442 for details. Naming include file to another name, say datanodes and not the opposite of exclude would have reduced confusion.
          Hide
          Todd Lipcon added a comment -

          Indeed that's how it works now. The point is that this confuses new users (whenever I have taught a class of new Hadoop admins there have been questions on this front, and adding yet one more list of nodes is going to confuse them even more)

          Show
          Todd Lipcon added a comment - Indeed that's how it works now. The point is that this confuses new users (whenever I have taught a class of new Hadoop admins there have been questions on this front, and adding yet one more list of nodes is going to confuse them even more)
          Hide
          Suresh Srinivas added a comment -

          Current mechanism requires both files to be in include and exclude.

          The purpose of include is to list datanodes that needs to be tracked by the namenode. Without this, namenode does not know the datanodes that could connect to it and hence cannot show dead node list if the datanode never registered with it.

          Similarly the nodes in decom list are in include also.

          Show
          Suresh Srinivas added a comment - Current mechanism requires both files to be in include and exclude. The purpose of include is to list datanodes that needs to be tracked by the namenode. Without this, namenode does not know the datanodes that could connect to it and hence cannot show dead node list if the datanode never registered with it. Similarly the nodes in decom list are in include also.
          Hide
          Todd Lipcon added a comment -

          One thing I've thought about a few times that I think should be considered: rather than having separate include/exclude/decom files, why not have one file with a very simple .properties style format?

          dn1 = active
          dn2 = active
          dn3 = decomission
          dn4 = exclude
          

          This way we don't have to enumerate all the combinatorics like "what if it's in both include and exclude? what if it's in decomission but not include?"

          Show
          Todd Lipcon added a comment - One thing I've thought about a few times that I think should be considered: rather than having separate include/exclude/decom files, why not have one file with a very simple .properties style format? dn1 = active dn2 = active dn3 = decomission dn4 = exclude This way we don't have to enumerate all the combinatorics like "what if it's in both include and exclude? what if it's in decomission but not include?"
          Hide
          Suresh Srinivas added a comment -

          > Have you considered adding RPCs (and CLI wrappers for them) for explicitly updating data node states...
          This is some thing we could do, in another jira. Only requirement is, this is an absolute list that is sent to namenode (instead of add/deletes etc).

          Show
          Suresh Srinivas added a comment - > Have you considered adding RPCs (and CLI wrappers for them) for explicitly updating data node states... This is some thing we could do, in another jira. Only requirement is, this is an absolute list that is sent to namenode (instead of add/deletes etc).
          Hide
          Suresh Srinivas added a comment -

          > decommmision-in-progress: decommissioning has started at the datanode. Namenode sends this state to the datanodes. Datanodes reduce frequency of heartbeats to Max(heartbeat time, datanode expiry time/2)?

          Forgot to make the change in the earlier comment. Reduced frequency of heartbeats will only be done in decommissioned state. These optimizations are optional.

          Show
          Suresh Srinivas added a comment - > decommmision-in-progress: decommissioning has started at the datanode. Namenode sends this state to the datanodes. Datanodes reduce frequency of heartbeats to Max(heartbeat time, datanode expiry time/2)? Forgot to make the change in the earlier comment. Reduced frequency of heartbeats will only be done in decommissioned state. These optimizations are optional.
          Hide
          Philip Zeyliger added a comment -

          Hi Suresh,

          For decomissioning, it's always seemed a bit odd that interacting with the system requires editing a file local to the namenode as well as sending an RPC. Have you considered adding RPCs (and CLI wrappers for them) for explicitly updating data node states, without going through a file?

          Thanks,

          – Philip

          Show
          Philip Zeyliger added a comment - Hi Suresh, For decomissioning, it's always seemed a bit odd that interacting with the system requires editing a file local to the namenode as well as sending an RPC. Have you considered adding RPCs (and CLI wrappers for them) for explicitly updating data node states, without going through a file? Thanks, – Philip
          Hide
          Suresh Srinivas added a comment -

          Background

          1. To decommission datanodes, the datanodes are added to exclude file on namenode, followed by "refereshNodes" command.
            • NN starts decommissioning registered datanodes.
            • If a datanode is not registered, then NN excludes it from registering.
            • Removing a datanode from exclude file stops decommissioning.
          2. Decommissioning is complete when all the replicas of the datanode are replicated to other nodes.
          3. After a node is decommissioned, it can no longer talk to NN. This results in datanode shutdown.

          Problems

          1. exclude file has overloaded semantics - it is used to exclued a datanode from registering and also for decommission. This is confusing and results in the following issues:
            • If a datanode is not registered, after adding it to exclude list, it is not allowed to register and hence not decommissioned.
            • After adding a datanode to exclude list, restarting NN results in decom not completing as the datanodes are excluded from registration.
            • If the datanode has the only set of replicas for a block in the above two scenarios, disallowed registration causes missing blocks and corrupt files.
          2. When decom is done, the datanode shuts down. There is no way to figure out if a datanode died before decom is complete or died because decom is complete.

          Proposed changes

          New decom file

          1. A new file will be used for listing datanodes to be decommissioned. Adding datanode to this starts decommissioning. Removing datanode from this file stops decommissioning.
          2. Namenode allows registration from nodes in decom list. Continues decommissioning if not complete.
          3. Decommissioned datanodes are not automatically shutdown. Theyhave to be shutdown by the administrator.

          Datanode will have following new states:

          1. inservice - datanode is registered with the namenode and is providing storage service.
          2. decommmision-in-progress: decommissioning has started at the datanode. Namenode sends this state to the datanodes. Datanodes reduce frequency of heartbeats to Max(heartbeat time, datanode expiry time/2)?
          3. decommissioned - decommissioning is complete. Namenode sends this state to the datanode. In this state datanode no longer sends block reports to the namenode and also reduces the frequency of heartbeats.
          4. disconnected - datanode is not able to communicate with the namenode.

          I need feedback on a choice we need to make:

          1. For backward compatibility continue to support using exclude file for decommissioning, with the current semantics. This is in addition to new way of decommissioning a node by adding it to a separate decom file. When a node is in both the files, decom file takes precedence.
          2. Deprecate support for decom using exclude file. This will not be backward compatible.
          Show
          Suresh Srinivas added a comment - Background To decommission datanodes, the datanodes are added to exclude file on namenode, followed by "refereshNodes" command. NN starts decommissioning registered datanodes. If a datanode is not registered, then NN excludes it from registering. Removing a datanode from exclude file stops decommissioning. Decommissioning is complete when all the replicas of the datanode are replicated to other nodes. After a node is decommissioned, it can no longer talk to NN. This results in datanode shutdown. Problems exclude file has overloaded semantics - it is used to exclued a datanode from registering and also for decommission. This is confusing and results in the following issues: If a datanode is not registered, after adding it to exclude list, it is not allowed to register and hence not decommissioned. After adding a datanode to exclude list, restarting NN results in decom not completing as the datanodes are excluded from registration. If the datanode has the only set of replicas for a block in the above two scenarios, disallowed registration causes missing blocks and corrupt files. When decom is done, the datanode shuts down. There is no way to figure out if a datanode died before decom is complete or died because decom is complete. Proposed changes New decom file A new file will be used for listing datanodes to be decommissioned. Adding datanode to this starts decommissioning. Removing datanode from this file stops decommissioning. Namenode allows registration from nodes in decom list. Continues decommissioning if not complete. Decommissioned datanodes are not automatically shutdown. Theyhave to be shutdown by the administrator. Datanode will have following new states: inservice - datanode is registered with the namenode and is providing storage service. decommmision-in-progress: decommissioning has started at the datanode. Namenode sends this state to the datanodes. Datanodes reduce frequency of heartbeats to Max(heartbeat time, datanode expiry time/2)? decommissioned - decommissioning is complete. Namenode sends this state to the datanode. In this state datanode no longer sends block reports to the namenode and also reduces the frequency of heartbeats. disconnected - datanode is not able to communicate with the namenode. I need feedback on a choice we need to make: For backward compatibility continue to support using exclude file for decommissioning, with the current semantics. This is in addition to new way of decommissioning a node by adding it to a separate decom file. When a node is in both the files, decom file takes precedence. Deprecate support for decom using exclude file. This will not be backward compatible.

            People

            • Assignee:
              Suresh Srinivas
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development