Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7877

Support maintenance state for datanodes

    Details

    • Hadoop Flags:
      Reviewed

      Description

      This requirement came up during the design for HDFS-7541. Given this feature is mostly independent of upgrade domain feature, it is better to track it under a separate jira. The design and draft patch will be available soon.

      1. HDFS-7877.patch
        115 kB
        Ming Ma
      2. HDFS-7877-2.patch
        158 kB
        Ming Ma
      3. Supportmaintenancestatefordatanodes.pdf
        282 kB
        Ming Ma
      4. Supportmaintenancestatefordatanodes-2.pdf
        257 kB
        Ming Ma

        Issue Links

          Activity

          Hide
          mingma Ming Ma added a comment -

          Here are the initial design document and draft patch. Appreciate any input others might have.

          To support maintenance state, we need to provide admins interface, manage the datanode state transitions and handle block related operations.

          After we agree on the design, we can break the feature into subtasks.

          Show
          mingma Ming Ma added a comment - Here are the initial design document and draft patch. Appreciate any input others might have. To support maintenance state, we need to provide admins interface, manage the datanode state transitions and handle block related operations. After we agree on the design, we can break the feature into subtasks.
          Hide
          aw Allen Wittenauer added a comment -

          Isn't this effectively a dupe of HDFS-6729?

          Show
          aw Allen Wittenauer added a comment - Isn't this effectively a dupe of HDFS-6729 ?
          Hide
          mingma Ming Ma added a comment -

          Thanks Allen for pointing out. We didn't know about HDFS-6729 at all. Let me check out the approach in that jira and we can combine the effort.

          Show
          mingma Ming Ma added a comment - Thanks Allen for pointing out. We didn't know about HDFS-6729 at all. Let me check out the approach in that jira and we can combine the effort.
          Hide
          eddyxu Lei (Eddy) Xu added a comment -

          Hey Ming Ma Thanks a lot for working on this. I am glad that this issue is being picked up!

          Please allow me sometime to go through your docs and patch. I will post comments shortly.

          Show
          eddyxu Lei (Eddy) Xu added a comment - Hey Ming Ma Thanks a lot for working on this. I am glad that this issue is being picked up! Please allow me sometime to go through your docs and patch. I will post comments shortly.
          Hide
          eddyxu Lei (Eddy) Xu added a comment -

          Hi, Ming Ma. This work looks great and more comprehensive than HDFS-6729. Especially I like the design that NN checks the single replica of blocks before setting DN to maintenance mode: it is safer than HDFS-6729.

          I have a few questions regarding the rest of your design.

          • Why is the node state the combination of <live|dead> and In service|Decommissioned|In maintenance..? Do we need to keep a DN in maintenance mode if it is dead? It makes the state machine very complex.
          • DN state (e.g., enter_maintenance or in_maintenance ) is kept in NN's memory? After NN re-starts, I think NN could not find out whether DN is in enter_maintenance or in_maintenance mode? Is there any default mode you will assume for a DN? Or is there a way for NN to decide which state the DN is in?
          • Moreover, after NN restarts, if a DN is actually in the maintenance mode (DN is shutting down for maintenance), NN could not receive block reports from this DN. If this is the case, would NN miscalculate the blockMap?
          • put the dead node into maintenance mode

            Would it be necessary? As you mentioned, when a DN is dead, its blocks are already replicated to other nodes. In my understand, the maintenance mode is a way to let NN not to move data when the DN is actually offline. The logic, which brings back a dead IN_MAINTENANCE DN and removes replicas from block maps, looks very similar to restart a (dead) DN. Could it simply reuse that logic?

          • In HDFS-6729, I considered maintenance mode as a temporary soft state, because what I understand is that putting a DN into maintenance mode is risking the availability of data. It essentially asks NN to ignore one "dead" (in maintenance) replica. As a result, I did not put DNs into a persistent configure file and let user to specify a timeout for DN to be in maintenance mode. When the timeout expires (i.e., 1 hour maintenance window), NN considers this DN as dead and re-replicas blocks on this DN to somewhere else. Does it make sense to you? Could you address this concern in your design?

          Looking forward to hear from you, Ming Ma. Thanks again for this great work!

          Show
          eddyxu Lei (Eddy) Xu added a comment - Hi, Ming Ma . This work looks great and more comprehensive than HDFS-6729 . Especially I like the design that NN checks the single replica of blocks before setting DN to maintenance mode: it is safer than HDFS-6729 . I have a few questions regarding the rest of your design. Why is the node state the combination of <live|dead> and In service|Decommissioned|In maintenance.. ? Do we need to keep a DN in maintenance mode if it is dead? It makes the state machine very complex. DN state (e.g., enter_maintenance or in_maintenance ) is kept in NN's memory? After NN re-starts, I think NN could not find out whether DN is in enter_maintenance or in_maintenance mode? Is there any default mode you will assume for a DN? Or is there a way for NN to decide which state the DN is in? Moreover, after NN restarts, if a DN is actually in the maintenance mode (DN is shutting down for maintenance), NN could not receive block reports from this DN. If this is the case, would NN miscalculate the blockMap? put the dead node into maintenance mode Would it be necessary? As you mentioned, when a DN is dead, its blocks are already replicated to other nodes. In my understand, the maintenance mode is a way to let NN not to move data when the DN is actually offline. The logic, which brings back a dead IN_MAINTENANCE DN and removes replicas from block maps, looks very similar to restart a (dead) DN. Could it simply reuse that logic? In HDFS-6729 , I considered maintenance mode as a temporary soft state, because what I understand is that putting a DN into maintenance mode is risking the availability of data. It essentially asks NN to ignore one "dead" (in maintenance) replica. As a result, I did not put DNs into a persistent configure file and let user to specify a timeout for DN to be in maintenance mode. When the timeout expires (i.e., 1 hour maintenance window), NN considers this DN as dead and re-replicas blocks on this DN to somewhere else. Does it make sense to you? Could you address this concern in your design? Looking forward to hear from you, Ming Ma . Thanks again for this great work!
          Hide
          mingma Ming Ma added a comment -

          Thanks Eddy for the review and suggestions. Please find my response below. Chris might have more to add.

          Why is the node state the combination of <live|dead> and In service|Decommissioned|In maintenance..?

          There are two state machines for datanode. One is called liveness state. Another one is called admin state. HDFS-7521 has some discussion around that. So datanode can be in any combination of these two states. That is why we have the case where if a node becomes dead when it is being decommissioned, it will remains in DECOMMISSION_IN_PROGRESS state until all the blocks are properly replicated.

          After NN re-starts, I think NN could not find out whether DN is in enter_maintenance or in_maintenance mode?

          The design handles the datanode state management for ENTERING_MAINTENANCE and IN_MAINTENANCE somewhat similar to DECOMMISSION_IN_PROGRESS and DECOMMISSIONED in the following ways.

          1. When a node registers with NN ( could be datanode restart or NN restart ), it will first transition to DECOMMISSION_IN_PROGRESS if it is in exclude file; or ENTERING_MAINTENANCE if it is in maintenance file.
          2. Only after target replication has been reached, it will be transitioned to the final state, DECOMMISSIONED or IN_MAINTENANCE.

          Moreover, after NN restarts, if a DN is actually in the maintenance mode (DN is shutting down for maintenance), NN could not receive block reports from this DN.

          After NN restarts, if a DN in maintenance file doesn't register with NN, then it won't be in DatanodeManager's datanodeMap and thus the state won't be tracked. So it should be similar to how decommission is handled.

          If the DN does register with NN, there is a bug in the patch that doesn't check if NN has received blockreport from the DN so that it doesn't prematurely transition the DN to in_maintenance state.

          Is "put the dead node into maintenance mode" necessary?

          Good question, if it is ok to keep the node in dead, normal state when admins add the node to maintenance file.

          The intention is to make it consistent with the actual content in maintenance file. It is similar to how decommission is handled; if you add a dead node to exclude file, the node will go directly into DECOMMISSIONED state. For replicas processing, dead, in_maintenance -> live, in_maintenance won't trigger excess blocks removal; live, in_maintenance -> live, normal will.

          Timeout support

          Good suggestion. We discussed this topic during the design discussion. We feel like the admin script can handle that outside HDFS; upon timeout, the admin script can remove the node from maintenance file and thus trigger replication. If we support timeout in HDFS, nodes in maintenance files won't necessarily be in maintenance states. Alternatively we can add another state called maintenance_timeout. But that might be too complicated. I can understand the benefit of having a timeout here. So we would like to hear others suggestion.

          There are two new topics we want to bring up.

          • The original design doc uses cluster default minimal replication factor to decide if the node can exit ENTERING_MAINTENANCE state. We might want to use a new config value so that we can set the value to two. For scenario like hadoop software upgrade, if used together with upgrade domain "two replicas" will be met right away for most blocks. For scenario like rack repair, "two replicas" can give us better data availability. At least we can test out different values independent of the cluster's minimal replication factor.
          • If read is allowed on node in ENTERING_MAINTENANCE state. Perhaps we should support that. That will handle the case where that is the only replica available. We can put such replica at the end of LocatedBlock.
          Show
          mingma Ming Ma added a comment - Thanks Eddy for the review and suggestions. Please find my response below. Chris might have more to add. Why is the node state the combination of <live|dead> and In service|Decommissioned|In maintenance..? There are two state machines for datanode. One is called liveness state. Another one is called admin state. HDFS-7521 has some discussion around that. So datanode can be in any combination of these two states. That is why we have the case where if a node becomes dead when it is being decommissioned, it will remains in DECOMMISSION_IN_PROGRESS state until all the blocks are properly replicated. After NN re-starts, I think NN could not find out whether DN is in enter_maintenance or in_maintenance mode? The design handles the datanode state management for ENTERING_MAINTENANCE and IN_MAINTENANCE somewhat similar to DECOMMISSION_IN_PROGRESS and DECOMMISSIONED in the following ways. 1. When a node registers with NN ( could be datanode restart or NN restart ), it will first transition to DECOMMISSION_IN_PROGRESS if it is in exclude file; or ENTERING_MAINTENANCE if it is in maintenance file. 2. Only after target replication has been reached, it will be transitioned to the final state, DECOMMISSIONED or IN_MAINTENANCE. Moreover, after NN restarts, if a DN is actually in the maintenance mode (DN is shutting down for maintenance), NN could not receive block reports from this DN. After NN restarts, if a DN in maintenance file doesn't register with NN, then it won't be in DatanodeManager 's datanodeMap and thus the state won't be tracked. So it should be similar to how decommission is handled. If the DN does register with NN, there is a bug in the patch that doesn't check if NN has received blockreport from the DN so that it doesn't prematurely transition the DN to in_maintenance state. Is "put the dead node into maintenance mode" necessary? Good question, if it is ok to keep the node in dead, normal state when admins add the node to maintenance file. The intention is to make it consistent with the actual content in maintenance file. It is similar to how decommission is handled; if you add a dead node to exclude file, the node will go directly into DECOMMISSIONED state. For replicas processing, dead, in_maintenance -> live, in_maintenance won't trigger excess blocks removal; live, in_maintenance -> live, normal will. Timeout support Good suggestion. We discussed this topic during the design discussion. We feel like the admin script can handle that outside HDFS; upon timeout, the admin script can remove the node from maintenance file and thus trigger replication. If we support timeout in HDFS, nodes in maintenance files won't necessarily be in maintenance states. Alternatively we can add another state called maintenance_timeout. But that might be too complicated. I can understand the benefit of having a timeout here. So we would like to hear others suggestion. There are two new topics we want to bring up. The original design doc uses cluster default minimal replication factor to decide if the node can exit ENTERING_MAINTENANCE state. We might want to use a new config value so that we can set the value to two. For scenario like hadoop software upgrade, if used together with upgrade domain "two replicas" will be met right away for most blocks. For scenario like rack repair, "two replicas" can give us better data availability. At least we can test out different values independent of the cluster's minimal replication factor. If read is allowed on node in ENTERING_MAINTENANCE state. Perhaps we should support that. That will handle the case where that is the only replica available. We can put such replica at the end of LocatedBlock.
          Hide
          mingma Ming Ma added a comment -

          Here is the updated design to add the support for configurable minimal maintenance replicator factor and the support of read operation on "live entering_maintenance" nodes. The patch has been updated accordingly with bug fixes and more unit tests. Appreciate any input others might have.

          Show
          mingma Ming Ma added a comment - Here is the updated design to add the support for configurable minimal maintenance replicator factor and the support of read operation on "live entering_maintenance" nodes. The patch has been updated accordingly with bug fixes and more unit tests. Appreciate any input others might have.
          Hide
          kihwal Kihwal Lee added a comment -

          Rajiv Chittajallu It will be nice if we can get your perspective on this.

          Show
          kihwal Kihwal Lee added a comment - Rajiv Chittajallu It will be nice if we can get your perspective on this.
          Hide
          rajive Rajiv Chittajallu added a comment -
          • It would be preferable to have a timeout of maintenance state, which would be higher than dfs.namenode.heartbeat.recheck-interval.
          • Instead of specifying hosts in a file, dfs.hosts.maintenance, can this be done via dfsadmin ? Maintenance mode is an temporary transient state and it would be simpler to not to track it via files.

          That is why we have the case where if a node becomes dead when it is being decommissioned, it will remains in DECOMMISSION_IN_PROGRESS state until all the blocks are properly replicated.

          If a datanode goes offline while decommissioning, it should be treated as dead and not be in DECOMMISSION_IN_PROGRESS state. Re-replicating blocks for nodes in dead state should be treated with higher priority.

          Show
          rajive Rajiv Chittajallu added a comment - It would be preferable to have a timeout of maintenance state, which would be higher than dfs.namenode.heartbeat.recheck-interval . Instead of specifying hosts in a file, dfs.hosts.maintenance , can this be done via dfsadmin ? Maintenance mode is an temporary transient state and it would be simpler to not to track it via files. That is why we have the case where if a node becomes dead when it is being decommissioned, it will remains in DECOMMISSION_IN_PROGRESS state until all the blocks are properly replicated. If a datanode goes offline while decommissioning, it should be treated as dead and not be in DECOMMISSION_IN_PROGRESS state. Re-replicating blocks for nodes in dead state should be treated with higher priority.
          Hide
          mingma Ming Ma added a comment -

          Thanks Rajiv Chittajallu for your input! I also discussed with Dan Romike.

          • Support for timeout. Sounds like folks prefer to have HDFS support that. That makes sense. Value of -1 could mean no timeout. In addition, based on current scenarios it seems we don't need to support per-host timeout; instead we can use some global timeout value.
          • Support for persistence. If we don't put the maintenance files into some file, it will be lost after NN restart. In other words, the node will be transitioned out of maintenance state upon NN restart. So from admin point of view, the node could be transitioned out of maintenance state prior to the timeout. Are we ok with such possible inconsistency?
          • If the node should be taken of DECOMMISSIONING when the node becomes dead. Admin state is separate from the liveness state. The reason the node is kept in DECOMMISSIONING state is to address data reliability issue. HDFS-6791 has more details.
          Show
          mingma Ming Ma added a comment - Thanks Rajiv Chittajallu for your input! I also discussed with Dan Romike . Support for timeout. Sounds like folks prefer to have HDFS support that. That makes sense. Value of -1 could mean no timeout. In addition, based on current scenarios it seems we don't need to support per-host timeout; instead we can use some global timeout value. Support for persistence. If we don't put the maintenance files into some file, it will be lost after NN restart. In other words, the node will be transitioned out of maintenance state upon NN restart. So from admin point of view, the node could be transitioned out of maintenance state prior to the timeout. Are we ok with such possible inconsistency? If the node should be taken of DECOMMISSIONING when the node becomes dead. Admin state is separate from the liveness state. The reason the node is kept in DECOMMISSIONING state is to address data reliability issue. HDFS-6791 has more details.
          Hide
          jrottinghuis Joep Rottinghuis added a comment -

          What do we need to do to get this going (again) in OSS? Just FYI, we're moving forward with this at Twitter on production clusters.

          Show
          jrottinghuis Joep Rottinghuis added a comment - What do we need to do to get this going (again) in OSS? Just FYI, we're moving forward with this at Twitter on production clusters.
          Hide
          mingma Ming Ma added a comment -

          For the open issues around timeout and persistence, Chris Trezzo Lei (Eddy) Xu and I had some offline discussion. We also discussed with our admins. Appreciate input from others.

          • Timeout support. We should support it.
          • Persistence vs soft state. Persistence is desirable for some cases. But soft state is acceptable. From application's point of view, if it asks HDFS to timeout the maintenance state and ideally would like HDFS to honor the request (applications don't care failover and restart as long as HDFS is up). Soft state means HDFS wouldn't honor the timeout value if there are NN failover/restart. For some scenarios admins would prefer if HDFS can honor the request if there are any NN failover/restart; but they can also accept soft state approach.
          Show
          mingma Ming Ma added a comment - For the open issues around timeout and persistence, Chris Trezzo Lei (Eddy) Xu and I had some offline discussion. We also discussed with our admins. Appreciate input from others. Timeout support. We should support it. Persistence vs soft state. Persistence is desirable for some cases. But soft state is acceptable. From application's point of view, if it asks HDFS to timeout the maintenance state and ideally would like HDFS to honor the request (applications don't care failover and restart as long as HDFS is up). Soft state means HDFS wouldn't honor the timeout value if there are NN failover/restart. For some scenarios admins would prefer if HDFS can honor the request if there are any NN failover/restart; but they can also accept soft state approach.
          Hide
          mingma Ming Ma added a comment -

          Maybe we should try to support persistence for timeout. We can persist the maintenance expiration UTC time via some new mechanism discussed in HDFS-9005. The clock can be out of sync among NNs, but we can accept that given the maintenance timeout precision is in the order of minutes. Chris Trezzo Lei (Eddy) Xu, thought?

          Show
          mingma Ming Ma added a comment - Maybe we should try to support persistence for timeout. We can persist the maintenance expiration UTC time via some new mechanism discussed in HDFS-9005 . The clock can be out of sync among NNs, but we can accept that given the maintenance timeout precision is in the order of minutes. Chris Trezzo Lei (Eddy) Xu , thought?
          Hide
          manojg Manoj Govindassamy added a comment -

          Ming Ma,

          Dilaver brought up a good point regarding the restrictions put for the range allowed for the configuration dfs.namenode.maintenance.replication.min. Currently the allowed range for Maintenance Min Replication is 0 to dfs.namenode.replication.min(default=1). Users wanting not to affect the performance of the cluster would wish to have the Maintenance Min Replication number greater than 1, say 2. In the current design, it is possible to have this Maintenance Min Replication configuration, but only after changing the NameNode level Block Min Replication to 2, and which could slowdown the overall latency for client writes.

          Technically speaking we should be allowing Maintenance Min Replication to be in range 0 to dfs.replication.max. There is always config value of 0 for users not wanting any availability/performance during maintenance. And, performance centric workloads can still get maintenance done without major disruptions by having a bigger Maintenance Min Replication. So, any reasons why you wanted to have Maintenance Min Replication range to be restrictive and less than or equal to dfs.namenode.replication.min ? May be i am overlooking something here. Please clarify.

              if (minMaintenanceR < 0) {
                throw new IOException("Unexpected configuration parameters: "
                    + DFSConfigKeys.DFS_NAMENODE_MAINTENANCE_REPLICATION_MIN_KEY
                    + " = " + minMaintenanceR + " < 0");
              }
              if (minMaintenanceR > minR) {
                throw new IOException("Unexpected configuration parameters: "
                    + DFSConfigKeys.DFS_NAMENODE_MAINTENANCE_REPLICATION_MIN_KEY
                    + " = " + minMaintenanceR + " > "
                    + DFSConfigKeys.DFS_NAMENODE_REPLICATION_MIN_KEY
                    + " = " + minR);
          
          Show
          manojg Manoj Govindassamy added a comment - Ming Ma , Dilaver brought up a good point regarding the restrictions put for the range allowed for the configuration dfs.namenode.maintenance.replication.min . Currently the allowed range for Maintenance Min Replication is 0 to dfs.namenode.replication.min(default=1) . Users wanting not to affect the performance of the cluster would wish to have the Maintenance Min Replication number greater than 1, say 2. In the current design, it is possible to have this Maintenance Min Replication configuration, but only after changing the NameNode level Block Min Replication to 2, and which could slowdown the overall latency for client writes. Technically speaking we should be allowing Maintenance Min Replication to be in range 0 to dfs.replication.max . There is always config value of 0 for users not wanting any availability/performance during maintenance. And, performance centric workloads can still get maintenance done without major disruptions by having a bigger Maintenance Min Replication. So, any reasons why you wanted to have Maintenance Min Replication range to be restrictive and less than or equal to dfs.namenode.replication.min ? May be i am overlooking something here. Please clarify. if (minMaintenanceR < 0) { throw new IOException("Unexpected configuration parameters: " + DFSConfigKeys.DFS_NAMENODE_MAINTENANCE_REPLICATION_MIN_KEY + " = " + minMaintenanceR + " < 0"); } if (minMaintenanceR > minR) { throw new IOException("Unexpected configuration parameters: " + DFSConfigKeys.DFS_NAMENODE_MAINTENANCE_REPLICATION_MIN_KEY + " = " + minMaintenanceR + " > " + DFSConfigKeys.DFS_NAMENODE_REPLICATION_MIN_KEY + " = " + minR);
          Hide
          mingma Ming Ma added a comment - - edited

          Thanks Manoj Govindassamy and Dilaver for the good point. What you suggested makes sense. The reason we don't have this requirement so far is probably because when we put nodes into maintenance, we often do it one upgrade domain at a time, thus no two replicas will be put to maintenance at the same time.

          To confirm, given we still allow applications to create blocks with smaller replication factor than dfs.namenode.maintenance.replication.min, the transition policy from ENTERING_MAINTENANCE to IN_MAINTENANCE will become the # of live replicas >= min(dfs.namenode.maintenance.replication.min, replication factor).

          Show
          mingma Ming Ma added a comment - - edited Thanks Manoj Govindassamy and Dilaver for the good point. What you suggested makes sense. The reason we don't have this requirement so far is probably because when we put nodes into maintenance, we often do it one upgrade domain at a time, thus no two replicas will be put to maintenance at the same time. To confirm, given we still allow applications to create blocks with smaller replication factor than dfs.namenode.maintenance.replication.min , the transition policy from ENTERING_MAINTENANCE to IN_MAINTENANCE will become the # of live replicas >= min( dfs.namenode.maintenance.replication.min , replication factor).
          Hide
          manojg Manoj Govindassamy added a comment -

          Thanks Ming Ma. Got it, when you club this with Upgrade Domain, the impact is not that severe.

          I will make the following change for the Maintenance Min Replication range validation check.

          --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          @@ -484,12 +484,12 @@ public BlockManager(final Namesystem namesystem, boolean haEnabled,
                     + DFSConfigKeys.DFS_NAMENODE_MAINTENANCE_REPLICATION_MIN_KEY
                     + " = " + minMaintenanceR + " < 0");
               }
          -    if (minMaintenanceR > minR) {
          +    if (minMaintenanceR > defaultReplication) {
                 throw new IOException("Unexpected configuration parameters: "
                     + DFSConfigKeys.DFS_NAMENODE_MAINTENANCE_REPLICATION_MIN_KEY
                     + " = " + minMaintenanceR + " > "
          -          + DFSConfigKeys.DFS_NAMENODE_REPLICATION_MIN_KEY
          -          + " = " + minR);
          +          + DFSConfigKeys.DFS_REPLICATION_DEFAULT
          +          + " = " + defaultReplication);
               }
          
          

          the transition policy from ENTERING_MAINTENANCE to IN_MAINTENANCE will become the # of live replicas >= min(dfs.namenode.maintenance.replication.min, replication factor).

          But, the transition from ENTERIN_MM to IN_MM that is happening DecommissionManager#Monitor#check which in-turn calls DecommissionManager#isSufficient looks ok to me. Because, we allow files to be created with custom block replication count say 1, which can be lesser than the default dfs.replication=3. And, since we should not be counting in the Maintenance Replicas, the formula is, as it exists currently

          expectedRedundancy = file_block_replication_count=1 or the default_replication_cont=3
          Math.max(
                  expectedRedundancy - numberReplicas.maintenanceReplicas(),
                  getMinMaintenanceStorageNum(block));
          

          Let me know if I am missing something. Thanks.

          — related code snippets ----

            /**
             * Checks whether a block is sufficiently replicated/stored for
             * decommissioning. For replicated blocks or striped blocks, full-strength
             * replication or storage is not always necessary, hence "sufficient".
             * @return true if sufficient, else false.
             */
            private boolean isSufficient(BlockInfo block, BlockCollection bc,
                NumberReplicas numberReplicas, boolean isDecommission) {
              if (blockManager.hasEnoughEffectiveReplicas(block, numberReplicas, 0)) {
                // Block has enough replica, skip
                LOG.trace("Block {} does not need replication.", block);
                return true;
              }
          ..
          ..
          ..
          
          
          
            // Check if the number of live + pending replicas satisfies
            // the expected redundancy.
            boolean hasEnoughEffectiveReplicas(BlockInfo block,
                NumberReplicas numReplicas, int pendingReplicaNum) {
              int required = getExpectedLiveRedundancyNum(block, numReplicas);
              int numEffectiveReplicas = numReplicas.liveReplicas() + pendingReplicaNum;
              return (numEffectiveReplicas >= required) &&
                  (pendingReplicaNum > 0 || isPlacementPolicySatisfied(block));
            }
          
          
            // Exclude maintenance, but make sure it has minimal live replicas
            // to satisfy the maintenance requirement.
            public short getExpectedLiveRedundancyNum(BlockInfo block,
                NumberReplicas numberReplicas) {
              final short expectedRedundancy = getExpectedRedundancyNum(block);
              return (short) Math.max(expectedRedundancy -
                  numberReplicas.maintenanceReplicas(),
                  getMinMaintenanceStorageNum(block));
            }
          
          Show
          manojg Manoj Govindassamy added a comment - Thanks Ming Ma . Got it, when you club this with Upgrade Domain, the impact is not that severe. I will make the following change for the Maintenance Min Replication range validation check. --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java @@ -484,12 +484,12 @@ public BlockManager(final Namesystem namesystem, boolean haEnabled, + DFSConfigKeys.DFS_NAMENODE_MAINTENANCE_REPLICATION_MIN_KEY + " = " + minMaintenanceR + " < 0"); } - if (minMaintenanceR > minR) { + if (minMaintenanceR > defaultReplication) { throw new IOException("Unexpected configuration parameters: " + DFSConfigKeys.DFS_NAMENODE_MAINTENANCE_REPLICATION_MIN_KEY + " = " + minMaintenanceR + " > " - + DFSConfigKeys.DFS_NAMENODE_REPLICATION_MIN_KEY - + " = " + minR); + + DFSConfigKeys.DFS_REPLICATION_DEFAULT + + " = " + defaultReplication); } the transition policy from ENTERING_MAINTENANCE to IN_MAINTENANCE will become the # of live replicas >= min(dfs.namenode.maintenance.replication.min, replication factor). But, the transition from ENTERIN_MM to IN_MM that is happening DecommissionManager#Monitor#check which in-turn calls DecommissionManager#isSufficient looks ok to me. Because, we allow files to be created with custom block replication count say 1, which can be lesser than the default dfs.replication=3. And, since we should not be counting in the Maintenance Replicas, the formula is, as it exists currently expectedRedundancy = file_block_replication_count=1 or the default_replication_cont=3 Math.max( expectedRedundancy - numberReplicas.maintenanceReplicas(), getMinMaintenanceStorageNum(block)); Let me know if I am missing something. Thanks. — related code snippets ---- /** * Checks whether a block is sufficiently replicated/stored for * decommissioning. For replicated blocks or striped blocks, full-strength * replication or storage is not always necessary, hence "sufficient". * @return true if sufficient, else false. */ private boolean isSufficient(BlockInfo block, BlockCollection bc, NumberReplicas numberReplicas, boolean isDecommission) { if (blockManager.hasEnoughEffectiveReplicas(block, numberReplicas, 0)) { // Block has enough replica, skip LOG.trace("Block {} does not need replication.", block); return true; } .. .. .. // Check if the number of live + pending replicas satisfies // the expected redundancy. boolean hasEnoughEffectiveReplicas(BlockInfo block, NumberReplicas numReplicas, int pendingReplicaNum) { int required = getExpectedLiveRedundancyNum(block, numReplicas); int numEffectiveReplicas = numReplicas.liveReplicas() + pendingReplicaNum; return (numEffectiveReplicas >= required) && (pendingReplicaNum > 0 || isPlacementPolicySatisfied(block)); } // Exclude maintenance, but make sure it has minimal live replicas // to satisfy the maintenance requirement. public short getExpectedLiveRedundancyNum(BlockInfo block, NumberReplicas numberReplicas) { final short expectedRedundancy = getExpectedRedundancyNum(block); return (short) Math.max(expectedRedundancy - numberReplicas.maintenanceReplicas(), getMinMaintenanceStorageNum(block)); }
          Hide
          mingma Ming Ma added a comment -

          ok. Will follow up the discussion in HDFS-11412.

          Show
          mingma Ming Ma added a comment - ok. Will follow up the discussion in HDFS-11412 .
          Hide
          mingma Ming Ma added a comment -

          All sub tasks have been resolved. Thanks Chris Trezzo Lei (Eddy) Xu Manoj Govindassamy Elek, Marton Yiqun Lin and others for the contribution and discussion.

          Show
          mingma Ming Ma added a comment - All sub tasks have been resolved. Thanks Chris Trezzo Lei (Eddy) Xu Manoj Govindassamy Elek, Marton Yiqun Lin and others for the contribution and discussion.

            People

            • Assignee:
              mingma Ming Ma
              Reporter:
              mingma Ming Ma
            • Votes:
              1 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development