Hadoop Common
  1. Hadoop Common
  2. HADOOP-1652

Rebalance data blocks when new data nodes added or data nodes become full

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.16.0
    • Component/s: None
    • Labels:
      None

      Description

      When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism.

      This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster. An solution should meet the following requirements:
      1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
      2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
      3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.

      1. Balancer.html
        20 kB
        Ravi Phulari
      2. balancer.patch
        51 kB
        Hairong Kuang
      3. balancer1.patch
        65 kB
        Hairong Kuang
      4. balancer2.patch
        71 kB
        Hairong Kuang
      5. balancer3.patch
        75 kB
        Hairong Kuang
      6. balancer4.patch
        71 kB
        Hairong Kuang
      7. balancer5.patch
        71 kB
        Hairong Kuang
      8. balancer6.patch
        71 kB
        Hairong Kuang
      9. balancer7.patch
        72 kB
        Hairong Kuang
      10. balancer8.patch
        72 kB
        Hairong Kuang
      11. BalancerAdminGuide.pdf
        13 kB
        Hairong Kuang
      12. BalancerAdminGuide1.pdf
        14 kB
        Hairong Kuang
      13. BalancerUserGuide2.pdf
        14 kB
        Hairong Kuang
      14. RebalanceDesign4.pdf
        47 kB
        Hairong Kuang
      15. RebalanceDesign5.pdf
        45 kB
        Hairong Kuang
      16. RebalanceDesign6.pdf
        50 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hairong Kuang added a comment - - edited

          Here are some of my initial thoughts. Please comment.

          1. What's balance?
          A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
          An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
          An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold.
          A threshold is user configurable. A default value could be 20% of % used space.

          2. When to rebalance?
          Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
          Hadoop dfsadmin balance <start/stop/get>
          -----Start/stop data block rebalancing or query its status.

          3. How to balance?
          (a) Upon receiving a data block rebalancing request, a name node creates a Balancing thread.
          (b) The thread performs rebalancing iteratively.

          1. At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations;
          2. When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1).
          3. If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1).
          4. The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable.
          5. The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default.
            (c) The thread stops and frees itself when the cluster becomes balanced.
          Show
          Hairong Kuang added a comment - - edited Here are some of my initial thoughts. Please comment. 1. What's balance? A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster. An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold. An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold. A threshold is user configurable. A default value could be 20% of % used space. 2. When to rebalance? Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported: Hadoop dfsadmin balance <start/stop/get> -----Start/stop data block rebalancing or query its status. 3. How to balance? (a) Upon receiving a data block rebalancing request, a name node creates a Balancing thread. (b) The thread performs rebalancing iteratively. At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations; When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1). If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1). The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable. The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default. (c) The thread stops and frees itself when the cluster becomes balanced.
          Hide
          Hairong Kuang added a comment -

          Some more thoughts for discussion...

          1. Put the cluster in safe mode while rebalacing. This allows us to more aggressively schedules block moving tasks but it interrupts the current running of the cluster.
          2. Spawn a seprate process on the client side to do all the scheduling work. A name node ships a snapshot of all data node descriptors & all blocks to the process in the begining. In the end, the process sends all the scheduled tasks back to the namenode. This approach does not interrupt namenode work but it requires shipping large amount of data from namenode in the beginning & then to namenode in the end.

          Show
          Hairong Kuang added a comment - Some more thoughts for discussion... 1. Put the cluster in safe mode while rebalacing. This allows us to more aggressively schedules block moving tasks but it interrupts the current running of the cluster. 2. Spawn a seprate process on the client side to do all the scheduling work. A name node ships a snapshot of all data node descriptors & all blocks to the process in the begining. In the end, the process sends all the scheduled tasks back to the namenode. This approach does not interrupt namenode work but it requires shipping large amount of data from namenode in the beginning & then to namenode in the end.
          Hide
          Hairong Kuang added a comment -

          A destination datanode should also have an upper limit on the number of blocks to be scheduled to receive. Currently in dfs when re-replicating blocks, blocks are pushed from source to destination. It enforces max number of concurrent block transfers for the source, but it does not have any limit on the destination side. So a destination data node may end up receiving many blocks concurrently. Shall we also limit the number of concurret writes on a data node?

          Show
          Hairong Kuang added a comment - A destination datanode should also have an upper limit on the number of blocks to be scheduled to receive. Currently in dfs when re-replicating blocks, blocks are pushed from source to destination. It enforces max number of concurrent block transfers for the source, but it does not have any limit on the destination side. So a destination data node may end up receiving many blocks concurrently. Shall we also limit the number of concurret writes on a data node?
          Hide
          Enis Soztutar added a comment -

          A balanced cluster, in terms of disk space usage, should be one in which the percentage of used disk space is balanced. Please take a look at HADOOP-1530.

          Show
          Enis Soztutar added a comment - A balanced cluster, in terms of disk space usage, should be one in which the percentage of used disk space is balanced. Please take a look at HADOOP-1530 .
          Hide
          Hairong Kuang added a comment -

          Yes, I agree with you. My definition of a balanced cluster was in term of %used space.

          Show
          Hairong Kuang added a comment - Yes, I agree with you. My definition of a balanced cluster was in term of %used space.
          Hide
          Hairong Kuang added a comment -

          When there is no over-capacity data node in the cluster, a source data node should be chosen from data nodes whose %used space is above average %used space, among which we should favor data nodes that are on the same rack as the target data node. I am not sure if we should favor fuller data nodes or not because it is more expensive than a random selection, but it makes the algorithm to converge faster. This raised another question that how we can guarantee that the algorithm converges if rebalancing is not done in the safe mode.

          Show
          Hairong Kuang added a comment - When there is no over-capacity data node in the cluster, a source data node should be chosen from data nodes whose %used space is above average %used space, among which we should favor data nodes that are on the same rack as the target data node. I am not sure if we should favor fuller data nodes or not because it is more expensive than a random selection, but it makes the algorithm to converge faster. This raised another question that how we can guarantee that the algorithm converges if rebalancing is not done in the safe mode.
          Hide
          Stu Hood added a comment -

          ...among which we should favor data nodes that are on the same rack as the target data node.

          This is a tradeoff though... one of the reasons for recording what racks blocks are on is to prevent putting "all of your eggs in one basket" so to speak (See the middle paragraph of http://lucene.apache.org/hadoop/hdfs_design.html#Data+Replication ).

          Show
          Stu Hood added a comment - ...among which we should favor data nodes that are on the same rack as the target data node. This is a tradeoff though... one of the reasons for recording what racks blocks are on is to prevent putting "all of your eggs in one basket" so to speak (See the middle paragraph of http://lucene.apache.org/hadoop/hdfs_design.html#Data+Replication ).
          Hide
          Hairong Kuang added a comment -

          The reason that I'd like to favor data nodes that are on the same rack as the destination node is that dfs does not need to check if "all of your eggs are in one basket" when moving a block from such a node to the destination. When I say "move", I meant replicating to the destination and removing from the source.

          Show
          Hairong Kuang added a comment - The reason that I'd like to favor data nodes that are on the same rack as the destination node is that dfs does not need to check if "all of your eggs are in one basket" when moving a block from such a node to the destination. When I say "move", I meant replicating to the destination and removing from the source.
          Hide
          Arun C Murthy added a comment -

          I'd say, semantically, that a stronger definition of a balanced cluster is one in which the percentage of free disk-space (across all available partitions, per-node) is balanced.

          I know, it's a bit of a word-play, but yet ... smile

          Show
          Arun C Murthy added a comment - I'd say, semantically, that a stronger definition of a balanced cluster is one in which the percentage of free disk-space (across all available partitions, per-node) is balanced. I know, it's a bit of a word-play, but yet ... smile
          Hide
          Hairong Kuang added a comment -

          Arun, yes, I agree. This can be done by instructing data nodes to balance its partitions. I'd like to do it in a different jira.

          Show
          Hairong Kuang added a comment - Arun, yes, I agree. This can be done by instructing data nodes to balance its partitions. I'd like to do it in a different jira.
          Hide
          Hairong Kuang added a comment -

          This is a more detailed design document for the rebalancing tool. A major change is that the rebalancing decisions are made in a seperate process from name node. Data nodes are throttled to prevent rebalancing using too much network bandwidth. I also add the protocol design, race condition discusssion, and a test plan.

          Show
          Hairong Kuang added a comment - This is a more detailed design document for the rebalancing tool. A major change is that the rebalancing decisions are made in a seperate process from name node. Data nodes are throttled to prevent rebalancing using too much network bandwidth. I also add the protocol design, race condition discusssion, and a test plan.
          Hide
          Hairong Kuang added a comment -

          Attached is a new version of the design document which includes the following changes:
          1. A block is pulled by a destination instead of pushed from a source.
          2. A block can be pulled from a proxy source which contains the block and is closer to the destination or less loaded than the source.
          3. A source does not delete a block by itself. Instead a destination notifies the name node that a block is copied and a hint that a replica at the source should be removed if the block is over replicated. When the name node decides to remove an excessive replica, it favors the hint as long as doing so does not violate the rebalancing requirement 1. A source removes the block upon a request from the name node.

          Show
          Hairong Kuang added a comment - Attached is a new version of the design document which includes the following changes: 1. A block is pulled by a destination instead of pushed from a source. 2. A block can be pulled from a proxy source which contains the block and is closer to the destination or less loaded than the source. 3. A source does not delete a block by itself. Instead a destination notifies the name node that a block is copied and a hint that a replica at the source should be removed if the block is over replicated. When the name node decides to remove an excessive replica, it favors the hint as long as doing so does not violate the rebalancing requirement 1. A source removes the block upon a request from the name node.
          Hide
          Hairong Kuang added a comment - - edited

          Upload the most updated design document.

          Show
          Hairong Kuang added a comment - - edited Upload the most updated design document.
          Hide
          Hairong Kuang added a comment -

          A patch for review.

          Show
          Hairong Kuang added a comment - A patch for review.
          Hide
          Koji Noguchi added a comment -

          Should we perform rebalance if the cluster is not 'finalizeUpgrade' d?

          If yes, maybe one test case for this?

          Show
          Koji Noguchi added a comment - Should we perform rebalance if the cluster is not 'finalizeUpgrade' d? If yes, maybe one test case for this?
          Hide
          Hairong Kuang added a comment -

          Currently balancer does not check if the cluster is "finalizedUpgrage" d or not. I can run a test to see what's going on if reblancing is performed when a cluster is upgraded but not finalized yet.

          Show
          Hairong Kuang added a comment - Currently balancer does not check if the cluster is "finalizedUpgrage" d or not. I can run a test to see what's going on if reblancing is performed when a cluster is upgraded but not finalized yet.
          Hide
          Hairong Kuang added a comment -

          The balancer administrator guide is attached.

          Show
          Hairong Kuang added a comment - The balancer administrator guide is attached.
          Hide
          Hairong Kuang added a comment -

          Here is a new patch that incorporates Sanjay's first pass of review comments.

          Show
          Hairong Kuang added a comment - Here is a new patch that incorporates Sanjay's first pass of review comments.
          Hide
          Hairong Kuang added a comment -

          This patch incorported the second round of review comments from Sanjay. Particularly it adds the following features:
          1. Disallow more than one balancer running in an HDFS;
          2. Each balancing iteration runs no more than 20 minutes;
          3. Disallow a block to move more than once during the whole process of balancing;
          4. Each block move has a timeout;
          5. Each line of output has a timestamp.

          Show
          Hairong Kuang added a comment - This patch incorported the second round of review comments from Sanjay. Particularly it adds the following features: 1. Disallow more than one balancer running in an HDFS; 2. Each balancing iteration runs no more than 20 minutes; 3. Disallow a block to move more than once during the whole process of balancing; 4. Each block move has a timeout; 5. Each line of output has a timestamp.
          Hide
          Hairong Kuang added a comment -

          Here is a new administrator guide that reflects the change to the balancer.

          Show
          Hairong Kuang added a comment - Here is a new administrator guide that reflects the change to the balancer.
          Hide
          Hairong Kuang added a comment -

          I added one more change to the balancer. It shuffles the datanode array before constructing overUtlilizedDatanodeList, underUtilizedDatanodeList etc. This adds some randomness to the static source/target datanode match.

          Show
          Hairong Kuang added a comment - I added one more change to the balancer. It shuffles the datanode array before constructing overUtlilizedDatanodeList, underUtilizedDatanodeList etc. This adds some randomness to the static source/target datanode match.
          Hide
          Hairong Kuang added a comment -

          This is a new patch that's built from the most recent trunk. It removed the part of the code that gets comitted in HADOOP-2256.

          Show
          Hairong Kuang added a comment - This is a new patch that's built from the most recent trunk. It removed the part of the code that gets comitted in HADOOP-2256 .
          Hide
          Hairong Kuang added a comment -

          The patch incorporates last review comments from Sanjay.

          Show
          Hairong Kuang added a comment - The patch incorporates last review comments from Sanjay.
          Hide
          Hairong Kuang added a comment -

          This is an updated user guide.

          Show
          Hairong Kuang added a comment - This is an updated user guide.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12370962/balancer5.patch
          against trunk revision r601038.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs -1. The patch appears to introduce 9 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12370962/balancer5.patch against trunk revision r601038. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 9 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/console This message is automatically generated.
          Hide
          Sanjay Radia added a comment -

          Code looks good
          +1

          Show
          Sanjay Radia added a comment - Code looks good +1
          Hide
          Hairong Kuang added a comment -

          The patch fixed the findbugs errors.

          Show
          Hairong Kuang added a comment - The patch fixed the findbugs errors.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12370975/balancer6.patch
          against trunk revision r601111.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs -1. The patch appears to introduce 2 new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12370975/balancer6.patch against trunk revision r601111. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 2 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          findbugs fix.

          Show
          Hairong Kuang added a comment - findbugs fix.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12370987/balancer7.patch
          against trunk revision r601111.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12370987/balancer7.patch against trunk revision r601111. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          The patch has a minor change to make the junit test to run faster.

          Show
          Hairong Kuang added a comment - The patch has a minor change to make the junit test to run faster.
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Hairong!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Hairong!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #324 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/324/ )

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development