Hadoop Common
  1. Hadoop Common
  2. HADOOP-4061

Large number of decommission freezes the Namenode

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.17.2
    • Fix Version/s: 0.18.3, 0.19.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Added a new conf property dfs.namenode.decommission.nodes.per.interval so that NameNode checks decommission status of x nodes for every y seconds, where x is the value of dfs.namenode.decommission.nodes.per.interval and y is the value of dfs.namenode.decommission.interval.
      Show
      Added a new conf property dfs.namenode.decommission.nodes.per.interval so that NameNode checks decommission status of x nodes for every y seconds, where x is the value of dfs.namenode.decommission.nodes.per.interval and y is the value of dfs.namenode.decommission.interval.

      Description

      On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.

      When decommission started, namenode's queue overflowed every 6 minutes.

      Looking at the cpu usage, it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.

        public synchronized void decommissionedDatanodeCheck() {
          for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
               it.hasNext();) {
            DatanodeDescriptor node = it.next();
            checkDecommissionStateInternal(node);
          }
        }
      
      1. 4061_20081124c_0.18.patch
        18 kB
        Tsz Wo Nicholas Sze
      2. 4061_20081124c.patch
        20 kB
        Tsz Wo Nicholas Sze
      3. HADOOP-4061.patch
        13 kB
        Raghu Angadi
      4. 4061_20081124b.patch
        19 kB
        Tsz Wo Nicholas Sze
      5. 4061_20081124.patch
        20 kB
        Tsz Wo Nicholas Sze
      6. 4061_20081123.patch
        17 kB
        Tsz Wo Nicholas Sze
      7. 4061_20081120b.patch
        13 kB
        Tsz Wo Nicholas Sze
      8. 4061_20081120.patch
        13 kB
        Tsz Wo Nicholas Sze
      9. 4061_20081119.patch
        12 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #671 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/671/)
          . Throttle Datanode decommission monitoring in Namenode. (szetszwo)

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #671 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/671/ ) . Throttle Datanode decommission monitoring in Namenode. (szetszwo)
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I just committed this.

          Show
          Tsz Wo Nicholas Sze added a comment - I just committed this.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          4061_20081124c_0.18.patch: for 0.18

          The patch passed all tests locally. I am going to commit this to 0.18 and above.

          Show
          Tsz Wo Nicholas Sze added a comment - 4061_20081124c_0.18.patch: for 0.18 The patch passed all tests locally. I am going to commit this to 0.18 and above.
          Hide
          Tsz Wo Nicholas Sze added a comment -
               [exec] +1 overall.  
          
               [exec]     +1 @author.  The patch does not contain any @author tags.
          
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
          
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
          
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
          
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          
          Show
          Tsz Wo Nicholas Sze added a comment - [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > It is efficient, I think it is opposite? Every next() operation is a tree search..
          This is a good point.

          > there is a patch that does pretty much the same and 150+ lines smaller.
          Your patch does pretty much the same for the default conf values, i.e. 30 seconds and 5 nodes per interval. But what if set them to 1 hour and and 100 nodes?

          Thanks for your comments, Raghu!

          4061_20081124c.patch: use an Iterator instead of higherEntry(...) for linear time iteration.

          Show
          Tsz Wo Nicholas Sze added a comment - > It is efficient, I think it is opposite? Every next() operation is a tree search.. This is a good point. > there is a patch that does pretty much the same and 150+ lines smaller. Your patch does pretty much the same for the default conf values, i.e. 30 seconds and 5 nodes per interval. But what if set them to 1 hour and and 100 nodes? Thanks for your comments, Raghu! 4061_20081124c.patch: use an Iterator instead of higherEntry(...) for linear time iteration.
          Hide
          Raghu Angadi added a comment -

          A non-binding -1 from me. Moving extra code to a new file does not make it free (of maintanance)

          Show
          Raghu Angadi added a comment - A non-binding -1 from me. Moving extra code to a new file does not make it free (of maintanance)
          Hide
          Raghu Angadi added a comment -

          It is efficient, I think it is opposite? Every next() operation is a tree search.. Assuming that is ok, there is a patch that does pretty much the same and 150+ lines smaller. So once in a while one iteration might scan less than 5, that is ok.. thats why we have shorter 30 second sleep.

          Nicholas, please attache your patch again so that Hudson picks yours.

          Show
          Raghu Angadi added a comment - It is efficient, I think it is opposite? Every next() operation is a tree search.. Assuming that is ok, there is a patch that does pretty much the same and 150+ lines smaller. So once in a while one iteration might scan less than 5, that is ok.. thats why we have shorter 30 second sleep. Nicholas, please attache your patch again so that Hudson picks yours.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Suresh, thanks for taking a look of my patch!

          > In CyclicIterable.java, navigablemap != null check not required
          You are right.

          > In DecommissionManager.java, move setting the firstkey in to the return condition.
          We cannot move setting the firstkey since the iteration may ends before ++count == numNodesPerCheck.

          4061_20081124b.patch: removed navigablemap != null check and changed some log messages.

          Show
          Tsz Wo Nicholas Sze added a comment - Suresh, thanks for taking a look of my patch! > In CyclicIterable.java, navigablemap != null check not required You are right. > In DecommissionManager.java, move setting the firstkey in to the return condition. We cannot move setting the firstkey since the iteration may ends before ++count == numNodesPerCheck. 4061_20081124b.patch: removed navigablemap != null check and changed some log messages.
          Hide
          Suresh Srinivas added a comment -

          Minor nits:
          In CyclicIterable.java, navigablemap != null check not required:

          hasnext = navigablemap != null && !next.equals(first);
          

          In DecommissionManager.java, move setting the firstkey in to the return condition.

          +          if (++count == numNodesPerCheck) {
          >> add here          firstkey = entry.getKey();
          +            return;
          +          }
          
          Show
          Suresh Srinivas added a comment - Minor nits: In CyclicIterable.java, navigablemap != null check not required: hasnext = navigablemap != null && !next.equals(first); In DecommissionManager.java, move setting the firstkey in to the return condition. + if (++count == numNodesPerCheck) { >> add here firstkey = entry.getKey(); + return; + }
          Hide
          Konstantin Shvachko added a comment -

          I think this code is actually simpler than the original version and, importantly, more efficient.
          There is a clear iterator that hides all iterating logic inside.
          If there is a way to avoid storing references then why store.
          The way Nicholas did this is by using standard Java library classes and methods.

          Show
          Konstantin Shvachko added a comment - I think this code is actually simpler than the original version and, importantly, more efficient. There is a clear iterator that hides all iterating logic inside. If there is a way to avoid storing references then why store. The way Nicholas did this is by using standard Java library classes and methods.
          Hide
          Raghu Angadi added a comment -

          I think there is too much code that is not really required. probably complicating it much more than required.

          • Storing reference should be be fine. This is java.. we should just check if the DN is already deleted.
          • Alternately, the monitor can in each iteration :
              • list all the decommissioned nodes and
              • the just pick five of them randomly to check.

          just a suggestion.

          Show
          Raghu Angadi added a comment - I think there is too much code that is not really required. probably complicating it much more than required. Storing reference should be be fine. This is java.. we should just check if the DN is already deleted. Alternately, the monitor can in each iteration : list all the decommissioned nodes and the just pick five of them randomly to check. just a suggestion.
          Hide
          Konstantin Shvachko added a comment -

          +1 I like the approach of using CyclicIterator.

          Show
          Konstantin Shvachko added a comment - +1 I like the approach of using CyclicIterator.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          4061_20081124.patch: fixed a bug and added a test for CyclicIterator

          Show
          Tsz Wo Nicholas Sze added a comment - 4061_20081124.patch: fixed a bug and added a test for CyclicIterator
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I propose to store last checked DatanodeDescriptor lastDN in DatanodeManager and get a datanodeMap.tail(lastDN, nonInclusive) on each iteration. The tail map can be used instead of the queue, since it has analogous methods getFirstEntry() and pollFirstEntry().

          We can't poll elements from the tail map since the tail map is backed by the original map. Polling elements from the tail map changes the original map.

          We don't really need tail map. We need a cyclic iterator so that it starts an iteration from any point of the map and follows the map's ordering. If the iterator hits the last entry of the map, it will then continue from the first entry.

          4061_20081123.patch: implement cyclic iterator

          Show
          Tsz Wo Nicholas Sze added a comment - I propose to store last checked DatanodeDescriptor lastDN in DatanodeManager and get a datanodeMap.tail(lastDN, nonInclusive) on each iteration. The tail map can be used instead of the queue, since it has analogous methods getFirstEntry() and pollFirstEntry(). We can't poll elements from the tail map since the tail map is backed by the original map. Polling elements from the tail map changes the original map. We don't really need tail map. We need a cyclic iterator so that it starts an iteration from any point of the map and follows the map's ordering. If the iterator hits the last entry of the map, it will then continue from the first entry. 4061_20081123.patch: implement cyclic iterator
          Hide
          Konstantin Shvachko added a comment -

          This should work, but may be it would be even better not create any references at all.
          I propose to store last checked DatanodeDescriptor lastDN in DatanodeManager and get a datanodeMap.tail(lastDN, nonInclusive) on each iteration. The tail map can be used instead of the queue, since it has analogous methods getFirstEntry() and pollFirstEntry().

          Show
          Konstantin Shvachko added a comment - This should work, but may be it would be even better not create any references at all. I propose to store last checked DatanodeDescriptor lastDN in DatanodeManager and get a datanodeMap.tail(lastDN, nonInclusive) on each iteration. The tail map can be used instead of the queue, since it has analogous methods getFirstEntry() and pollFirstEntry() .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          4061_20081120b.patch: add only decommission-in-progress nodes to the DecommissionManager queue.

          Show
          Tsz Wo Nicholas Sze added a comment - 4061_20081120b.patch: add only decommission-in-progress nodes to the DecommissionManager queue.
          Hide
          Konstantin Shvachko added a comment -

          Nicholas, so what happens if a data-node dies and is removed from datanodeMap?
          It will still be in the DecommissionManager queue.
          Can we implement some kind of iterator of nodes in decommission-in-progress state by just scanning datanodeMap without storing extra references to data-nodes.

          Show
          Konstantin Shvachko added a comment - Nicholas, so what happens if a data-node dies and is removed from datanodeMap? It will still be in the DecommissionManager queue. Can we implement some kind of iterator of nodes in decommission-in-progress state by just scanning datanodeMap without storing extra references to data-nodes.
          Hide
          Raghu Angadi added a comment - - edited

          (edit : correct jira number)
          > I don't think we should redesign decommission feature here. Removing blocks will be a redesign.

          hmm.. this is not done in the patch anyway. It does not look like redesign in any way to me. It is fairly simple to me : if excess replica is on a normal DN, it will be deleted (right?), then there is no reason to keep it if that happens to be a decommissioned node. Alright, I noticed HADOOP-4701.

          Show
          Raghu Angadi added a comment - - edited (edit : correct jira number) > I don't think we should redesign decommission feature here. Removing blocks will be a redesign. hmm.. this is not done in the patch anyway. It does not look like redesign in any way to me. It is fairly simple to me : if excess replica is on a normal DN, it will be deleted (right?), then there is no reason to keep it if that happens to be a decommissioned node. Alright, I noticed HADOOP-4701 .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Created HADOOP-4701. Please continue the decommission performance improvement discussion there. Thanks.

          Show
          Tsz Wo Nicholas Sze added a comment - Created HADOOP-4701 . Please continue the decommission performance improvement discussion there. Thanks.
          Hide
          Konstantin Shvachko added a comment -

          Raghu, I don't think we should redesign decommission feature here. Removing blocks will be a redesign. The next question would be why this empty nodes are hanging around? Let's shut them all down. Please check HADOOP-681. We should optimize the process keeping the same behavior and semantics.
          In current design we treat decommissioned nodes as read only.

          Show
          Konstantin Shvachko added a comment - Raghu, I don't think we should redesign decommission feature here. Removing blocks will be a redesign. The next question would be why this empty nodes are hanging around? Let's shut them all down. Please check HADOOP-681 . We should optimize the process keeping the same behavior and semantics. In current design we treat decommissioned nodes as read only.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          4061_20081120.patch:

          Thanks, Konstantin and Raghu!

          @Konstantin

          • This is not true. dfs.namenode.decommission.interval is always in seconds, not minutes. Only the default value is changed. Could you check it again?
          • For code refactoring, I am going to do it in the next step. As mentioned before, I am going to improve decomission performance. For now, I will change the class name to DecommissionManager and have Monitor as an inner class.
          • I would rather rename it to i for iterator and declare it inside the for-loop header.

          Let me create a new issue for improving decomission performance and discuss it there. I believe we need more thought.

          @Raghu> the loop should count 5 decommissioned nodes

          Yes, we should count nodes with decommission in progress.

          Show
          Tsz Wo Nicholas Sze added a comment - 4061_20081120.patch: Thanks, Konstantin and Raghu! @Konstantin This is not true. dfs.namenode.decommission.interval is always in seconds, not minutes. Only the default value is changed. Could you check it again? For code refactoring, I am going to do it in the next step. As mentioned before, I am going to improve decomission performance. For now, I will change the class name to DecommissionManager and have Monitor as an inner class. I would rather rename it to i for iterator and declare it inside the for-loop header. Let me create a new issue for improving decomission performance and discuss it there. I believe we need more thought. @Raghu> the loop should count 5 decommissioned nodes Yes, we should count nodes with decommission in progress.
          Hide
          Raghu Angadi added a comment -

          > I would rather go with the approach, which counts down decommissioned blocks as they are replicated. Then there is no need to scan all blocks to verify the node is decommissioned, just check the counter.

          The above is also useful and should help a lot here. Note these unnecessary blocks have overhead in other operations too (not just when checking the decommisioned nodes). Essentially, there is no need to treat these seperately.. we just need to give preference to deleting blocks from decommissioned nodes when there is excess replication.

          Show
          Raghu Angadi added a comment - > I would rather go with the approach, which counts down decommissioned blocks as they are replicated. Then there is no need to scan all blocks to verify the node is decommissioned, just check the counter. The above is also useful and should help a lot here. Note these unnecessary blocks have overhead in other operations too (not just when checking the decommisioned nodes). Essentially, there is no need to treat these seperately.. we just need to give preference to deleting blocks from decommissioned nodes when there is excess replication.
          Hide
          Raghu Angadi added a comment -

          > Deleting already decommissioned blocks as Raghu proposes is also not very good. Until the node is shut down its blocks can be accessed for read. We don't want to change that.

          The proposal is to delete it after the block is properly replicated. That seems like the right thing to do and a must for scalability. The main CPU problem here is because of keeping all these excess blocks around.

          I don't see any use of keeping over rreplicated blocks. If it is useful then, we should not delete any over replicated block unless there is no space left.

          Regd the patch, the loop should count 5 decommissioned nodes rather than any 5 nodes. Otherwise, if you have 2000 nodes then each decommissioned node be checked only once in 16 hours or so (even if you decommission just one).

          Show
          Raghu Angadi added a comment - > Deleting already decommissioned blocks as Raghu proposes is also not very good. Until the node is shut down its blocks can be accessed for read. We don't want to change that. The proposal is to delete it after the block is properly replicated. That seems like the right thing to do and a must for scalability. The main CPU problem here is because of keeping all these excess blocks around. I don't see any use of keeping over rreplicated blocks. If it is useful then, we should not delete any over replicated block unless there is no space left. Regd the patch, the loop should count 5 decommissioned nodes rather than any 5 nodes. Otherwise, if you have 2000 nodes then each decommissioned node be checked only once in 16 hours or so (even if you decommission just one).
          Hide
          Konstantin Shvachko added a comment -
          • You changed the time units in dfs.namenode.decommission.interval from minutes to seconds. Is it going to be a problem for those who use this config variable? If they set it to 5 (now in minutes) then it is going to be every 5 seconds after your patch.
          • Do we want to introduce DecommissionMonitor decommissionManager member in FSNamesystem? Then we will be able to move all decommissioning logic into DecommissionMonitor or manager, which is probably partly a goal of this patch.
            • FSNamesystem.checkDecommissionStateInternal() should be moved in to DecommissionMonitor;
            • same as startDecommission() and stopDecommission().
          • In isReplicationInProgress() could you please rename decommissionBlocks to nodeBlocks. It has nothing to do with decommission and is confising.

          I think this throttling approach will solve the problem for now, but is not ideal. Say, if you have 500,000 blocks rather than 30,000 then you will have to reconfigure the throttler to scan even less nodes.
          Deleting already decommissioned blocks as Raghu proposes is also not very good. Until the node is shut down its blocks can be accessed for read. We don't want to change that.
          I would rather go with the approach, which counts down decommissioned blocks as they are replicated. Then there is no need to scan all blocks to verify the node is decommissioned, just check the counter. We can add the total block scan as a sanity check in stopDecommission(). The counter can also be a good indicator of how much decommissioning progress has been done at every moment.
          We should create a separate jira for these changes.

          Show
          Konstantin Shvachko added a comment - You changed the time units in dfs.namenode.decommission.interval from minutes to seconds. Is it going to be a problem for those who use this config variable? If they set it to 5 (now in minutes) then it is going to be every 5 seconds after your patch. Do we want to introduce DecommissionMonitor decommissionManager member in FSNamesystem ? Then we will be able to move all decommissioning logic into DecommissionMonitor or manager, which is probably partly a goal of this patch. FSNamesystem.checkDecommissionStateInternal() should be moved in to DecommissionMonitor ; same as startDecommission() and stopDecommission() . In isReplicationInProgress() could you please rename decommissionBlocks to nodeBlocks . It has nothing to do with decommission and is confising. I think this throttling approach will solve the problem for now, but is not ideal. Say, if you have 500,000 blocks rather than 30,000 then you will have to reconfigure the throttler to scan even less nodes. Deleting already decommissioned blocks as Raghu proposes is also not very good. Until the node is shut down its blocks can be accessed for read. We don't want to change that. I would rather go with the approach, which counts down decommissioned blocks as they are replicated. Then there is no need to scan all blocks to verify the node is decommissioned, just check the counter. We can add the total block scan as a sanity check in stopDecommission(). The counter can also be a good indicator of how much decommissioning progress has been done at every moment. We should create a separate jira for these changes.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          4061_20081119.patch: throttling decommission monitor.

          I think it is better to work on performance improvement and benchmark in separated issues.

          Show
          Tsz Wo Nicholas Sze added a comment - 4061_20081119.patch: throttling decommission monitor. I think it is better to work on performance improvement and benchmark in separated issues.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Instead of checking all blocks in all decommissioning nodes every 5 minutes, we should check a limited number of blocks in a shorter period. It would throttle decommissioning in Namenode. A drawback is that the decommission status update may be delayed.

          > Deleting A from Dx is correct because that how NN handles excess replicas in general.

          This is probably a good idea to improve the performance.

          Show
          Tsz Wo Nicholas Sze added a comment - Instead of checking all blocks in all decommissioning nodes every 5 minutes, we should check a limited number of blocks in a shorter period. It would throttle decommissioning in Namenode. A drawback is that the decommission status update may be delayed. > Deleting A from Dx is correct because that how NN handles excess replicas in general. This is probably a good idea to improve the performance.
          Hide
          Nigel Daley added a comment -

          This will need some repeatable testing at scale. A decommission benchmark?

          Show
          Nigel Daley added a comment - This will need some repeatable testing at scale. A decommission benchmark?
          Hide
          Raghu Angadi added a comment -

          Deleting A from Dx is correct because that how NN handles excess replicas in general.

          Show
          Raghu Angadi added a comment - Deleting A from Dx is correct because that how NN handles excess replicas in general.
          Hide
          Raghu Angadi added a comment -

          A related fix that could reduce this load is that NameNode should delete a block from the decommissioned node once it is replicated. i.e.

          If block A belongs to decommissioned node Dx, NameNode replicates the block with another node Dy. Once Dy reports that it has the block, NN should remove A from Dx. This should handled in addStoreBlock I guess.

          Show
          Raghu Angadi added a comment - A related fix that could reduce this load is that NameNode should delete a block from the decommissioned node once it is replicated. i.e. If block A belongs to decommissioned node Dx, NameNode replicates the block with another node Dy. Once Dy reports that it has the block, NN should remove A from Dx. This should handled in addStoreBlock I guess.
          Hide
          Konstantin Shvachko added a comment -

          decommissionedDatanodeCheck() is very inefficient.
          Every 5 minutes it scans all blocks belonging to decommissioning nodes and checks their replication.
          With 400 nodes on a 2000 node cluster it is one fifth of all blocks checked under a single lock.

          Show
          Konstantin Shvachko added a comment - decommissionedDatanodeCheck() is very inefficient. Every 5 minutes it scans all blocks belonging to decommissioning nodes and checks their replication. With 400 nodes on a 2000 node cluster it is one fifth of all blocks checked under a single lock.

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development