Issue Details (XML | Word | Printable)

Key: HADOOP-1652
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Hairong Kuang
Reporter: Hairong Kuang
Votes: 1
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Rebalance data blocks when new data nodes added or data nodes become full

Created: 25/Jul/07 06:47 PM   Updated: 13/Oct/09 09:31 PM
Return to search
Component/s: None
Affects Version/s: 0.13.0
Fix Version/s: 0.16.0

Time Tracking:
Not Specified

File Attachments:
  Size
HTML File Licensed for inclusion in ASF works Balancer.html 2009-10-13 09:31 PM Ravi Phulari 20 kB
Text File Licensed for inclusion in ASF works balancer.patch 2007-10-26 08:51 PM Hairong Kuang 51 kB
Text File Licensed for inclusion in ASF works balancer1.patch 2007-11-09 08:12 PM Hairong Kuang 65 kB
Text File Licensed for inclusion in ASF works balancer2.patch 2007-11-20 08:38 AM Hairong Kuang 71 kB
Text File Licensed for inclusion in ASF works balancer3.patch 2007-11-26 09:18 PM Hairong Kuang 75 kB
Text File Licensed for inclusion in ASF works balancer4.patch 2007-11-29 06:17 PM Hairong Kuang 71 kB
Text File Licensed for inclusion in ASF works balancer5.patch 2007-12-04 06:50 PM Hairong Kuang 71 kB
Text File Licensed for inclusion in ASF works balancer6.patch 2007-12-04 10:10 PM Hairong Kuang 71 kB
Text File Licensed for inclusion in ASF works balancer7.patch 2007-12-05 12:52 AM Hairong Kuang 72 kB
Text File Licensed for inclusion in ASF works balancer8.patch 2007-12-05 06:46 PM Hairong Kuang 72 kB
PDF File Licensed for inclusion in ASF works BalancerAdminGuide.pdf 2007-11-08 11:26 PM Hairong Kuang 13 kB
PDF File Licensed for inclusion in ASF works BalancerAdminGuide1.pdf 2007-11-20 08:43 AM Hairong Kuang 14 kB
PDF File Licensed for inclusion in ASF works BalancerUserGuide2.pdf 2007-12-04 07:48 PM Hairong Kuang 14 kB
PDF File Licensed for inclusion in ASF works RebalanceDesign4.pdf 2007-08-10 10:55 PM Hairong Kuang 47 kB
PDF File Licensed for inclusion in ASF works RebalanceDesign5.pdf 2007-08-22 09:08 PM Hairong Kuang 45 kB
PDF File Licensed for inclusion in ASF works RebalanceDesign6.pdf 2007-10-23 10:48 PM Hairong Kuang 50 kB
Issue Links:
Dependants
dependent
 

Resolution Date: 05/Dec/07 07:49 PM


 Description  « Hide
When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism.

This jira aims to find an approach to redistribute data blocks when imbalance occurs in the cluster. An solution should meet the following requirements:
1. It maintains data availablility guranteens in the sense that rebalancing does not reduce the number of replicas that a block has or the number of racks that the block resides.
2. An adminstrator should be able to invoke and interrupt rebalancing from a command line.
3. Rebalancing should be throttled so that rebalancing does not cause a namenode to be too busy to serve any incoming request or saturate the network.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hairong Kuang added a comment - 25/Jul/07 09:17 PM - edited
Here are some of my initial thoughts. Please comment.

1. What's balance?
A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold.
A threshold is user configurable. A default value could be 20% of % used space.

2. When to rebalance?
Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
Hadoop dfsadmin balance <start/stop/get>
-----Start/stop data block rebalancing or query its status.

3. How to balance?
(a) Upon receiving a data block rebalancing request, a name node creates a Balancing thread.
(b) The thread performs rebalancing iteratively.

  1. At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations;
  2. When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1).
  3. If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1).
  4. The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable.
  5. The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default.
    (c) The thread stops and frees itself when the cluster becomes balanced.

Hairong Kuang added a comment - 25/Jul/07 10:10 PM
Some more thoughts for discussion...

1. Put the cluster in safe mode while rebalacing. This allows us to more aggressively schedules block moving tasks but it interrupts the current running of the cluster.
2. Spawn a seprate process on the client side to do all the scheduling work. A name node ships a snapshot of all data node descriptors & all blocks to the process in the begining. In the end, the process sends all the scheduled tasks back to the namenode. This approach does not interrupt namenode work but it requires shipping large amount of data from namenode in the beginning & then to namenode in the end.


Hairong Kuang added a comment - 26/Jul/07 01:50 AM
A destination datanode should also have an upper limit on the number of blocks to be scheduled to receive. Currently in dfs when re-replicating blocks, blocks are pushed from source to destination. It enforces max number of concurrent block transfers for the source, but it does not have any limit on the destination side. So a destination data node may end up receiving many blocks concurrently. Shall we also limit the number of concurret writes on a data node?

Enis Soztutar added a comment - 26/Jul/07 06:04 AM
A balanced cluster, in terms of disk space usage, should be one in which the percentage of used disk space is balanced. Please take a look at HADOOP-1530.

Hairong Kuang added a comment - 26/Jul/07 05:40 PM
Yes, I agree with you. My definition of a balanced cluster was in term of %used space.

Hairong Kuang added a comment - 26/Jul/07 07:19 PM
When there is no over-capacity data node in the cluster, a source data node should be chosen from data nodes whose %used space is above average %used space, among which we should favor data nodes that are on the same rack as the target data node. I am not sure if we should favor fuller data nodes or not because it is more expensive than a random selection, but it makes the algorithm to converge faster. This raised another question that how we can guarantee that the algorithm converges if rebalancing is not done in the safe mode.

Stu Hood added a comment - 26/Jul/07 08:56 PM

...among which we should favor data nodes that are on the same rack as the target data node.

This is a tradeoff though... one of the reasons for recording what racks blocks are on is to prevent putting "all of your eggs in one basket" so to speak (See the middle paragraph of http://lucene.apache.org/hadoop/hdfs_design.html#Data+Replication ).


Hairong Kuang added a comment - 26/Jul/07 11:31 PM
The reason that I'd like to favor data nodes that are on the same rack as the destination node is that dfs does not need to check if "all of your eggs are in one basket" when moving a block from such a node to the destination. When I say "move", I meant replicating to the destination and removing from the source.

Arun C Murthy added a comment - 27/Jul/07 04:10 AM
I'd say, semantically, that a stronger definition of a balanced cluster is one in which the percentage of free disk-space (across all available partitions, per-node) is balanced.

I know, it's a bit of a word-play, but yet ... smile


Hairong Kuang added a comment - 28/Jul/07 12:05 AM
Arun, yes, I agree. This can be done by instructing data nodes to balance its partitions. I'd like to do it in a different jira.

Hairong Kuang added a comment - 10/Aug/07 10:55 PM
This is a more detailed design document for the rebalancing tool. A major change is that the rebalancing decisions are made in a seperate process from name node. Data nodes are throttled to prevent rebalancing using too much network bandwidth. I also add the protocol design, race condition discusssion, and a test plan.

Hairong Kuang made changes - 10/Aug/07 10:55 PM
Field Original Value New Value
Attachment RebalanceDesign4.pdf [ 12363636 ]
Hairong Kuang added a comment - 22/Aug/07 09:08 PM
Attached is a new version of the design document which includes the following changes:
1. A block is pulled by a destination instead of pushed from a source.
2. A block can be pulled from a proxy source which contains the block and is closer to the destination or less loaded than the source.
3. A source does not delete a block by itself. Instead a destination notifies the name node that a block is copied and a hint that a replica at the source should be removed if the block is over replicated. When the name node decides to remove an excessive replica, it favors the hint as long as doing so does not violate the rebalancing requirement 1. A source removes the block upon a request from the name node.

Hairong Kuang made changes - 22/Aug/07 09:08 PM
Attachment RebalanceDesign5.pdf [ 12364355 ]
Hairong Kuang made changes - 06/Sep/07 11:24 PM
Link This issue depends upon HADOOP-1266 [ HADOOP-1266 ]
Hairong Kuang made changes - 06/Sep/07 11:29 PM
Link This issue blocks HADOOP-1846 [ HADOOP-1846 ]
Hairong Kuang made changes - 06/Sep/07 11:30 PM
Link This issue blocks HADOOP-1846 [ HADOOP-1846 ]
Hairong Kuang made changes - 06/Sep/07 11:31 PM
Link This issue depends upon HADOOP-1846 [ HADOOP-1846 ]
Hairong Kuang made changes - 17/Sep/07 10:19 PM
Link This issue blocks HADOOP-1912 [ HADOOP-1912 ]
Hairong Kuang made changes - 17/Sep/07 10:22 PM
Link This issue blocks HADOOP-1912 [ HADOOP-1912 ]
Hairong Kuang made changes - 17/Sep/07 10:22 PM
Link This issue depends on HADOOP-1912 [ HADOOP-1912 ]
Hairong Kuang made changes - 18/Sep/07 12:05 AM
Link This issue depends on HADOOP-1914 [ HADOOP-1914 ]
Hairong Kuang made changes - 20/Sep/07 10:21 PM
Link This issue incorporates HADOOP-1914 [ HADOOP-1914 ]
Hairong Kuang made changes - 20/Sep/07 10:21 PM
Link This issue depends on HADOOP-1914 [ HADOOP-1914 ]
Hairong Kuang made changes - 20/Sep/07 10:22 PM
Link This issue incorporates HADOOP-1912 [ HADOOP-1912 ]
Hairong Kuang made changes - 20/Sep/07 10:23 PM
Link This issue depends on HADOOP-1912 [ HADOOP-1912 ]
Hairong Kuang made changes - 20/Sep/07 10:34 PM
Link This issue depends on HADOOP-1912 [ HADOOP-1912 ]
Hairong Kuang made changes - 20/Sep/07 10:35 PM
Link This issue depends on HADOOP-1914 [ HADOOP-1914 ]
Hairong Kuang made changes - 20/Sep/07 10:35 PM
Link This issue incorporates HADOOP-1912 [ HADOOP-1912 ]
Hairong Kuang made changes - 20/Sep/07 10:35 PM
Link This issue incorporates HADOOP-1914 [ HADOOP-1914 ]
Owen O'Malley made changes - 28/Sep/07 04:17 PM
Fix Version/s 0.15.0 [ 12312565 ]
Fix Version/s 0.16.0 [ 12312740 ]
Hairong Kuang added a comment - 23/Oct/07 10:48 PM - edited
Upload the most updated design document.

Hairong Kuang made changes - 23/Oct/07 10:48 PM
Attachment RebalanceDesign6.pdf [ 12368261 ]
Hairong Kuang added a comment - 26/Oct/07 08:51 PM
A patch for review.

Hairong Kuang made changes - 26/Oct/07 08:51 PM
Attachment balancer.patch [ 12368519 ]
Koji Noguchi added a comment - 29/Oct/07 11:11 PM
Should we perform rebalance if the cluster is not 'finalizeUpgrade' d?

If yes, maybe one test case for this?


Hairong Kuang added a comment - 30/Oct/07 02:44 AM
Currently balancer does not check if the cluster is "finalizedUpgrage" d or not. I can run a test to see what's going on if reblancing is performed when a cluster is upgraded but not finalized yet.

Hairong Kuang added a comment - 08/Nov/07 11:26 PM
The balancer administrator guide is attached.

Hairong Kuang made changes - 08/Nov/07 11:26 PM
Attachment BalancerAdminGuide.pdf [ 12369201 ]
Hairong Kuang added a comment - 09/Nov/07 07:10 PM
Here is a new patch that incorporates Sanjay's first pass of review comments.

Hairong Kuang made changes - 09/Nov/07 07:10 PM
Attachment balancer1.patch [ 12369254 ]
Hairong Kuang made changes - 09/Nov/07 08:11 PM
Attachment balancer1.patch [ 12369254 ]
Hairong Kuang made changes - 09/Nov/07 08:12 PM
Attachment balancer1.patch [ 12369265 ]
Hairong Kuang added a comment - 20/Nov/07 08:38 AM
This patch incorported the second round of review comments from Sanjay. Particularly it adds the following features:
1. Disallow more than one balancer running in an HDFS;
2. Each balancing iteration runs no more than 20 minutes;
3. Disallow a block to move more than once during the whole process of balancing;
4. Each block move has a timeout;
5. Each line of output has a timestamp.

Hairong Kuang made changes - 20/Nov/07 08:38 AM
Attachment balancer2.patch [ 12369856 ]
Hairong Kuang added a comment - 20/Nov/07 08:43 AM
Here is a new administrator guide that reflects the change to the balancer.

Hairong Kuang made changes - 20/Nov/07 08:43 AM
Attachment BalancerAdminGuide1.pdf [ 12369857 ]
Hairong Kuang added a comment - 26/Nov/07 09:18 PM
I added one more change to the balancer. It shuffles the datanode array before constructing overUtlilizedDatanodeList, underUtilizedDatanodeList etc. This adds some randomness to the static source/target datanode match.

Hairong Kuang made changes - 26/Nov/07 09:18 PM
Attachment balancer3.patch [ 12370239 ]
Hairong Kuang added a comment - 29/Nov/07 06:17 PM
This is a new patch that's built from the most recent trunk. It removed the part of the code that gets comitted in HADOOP-2256.

Hairong Kuang made changes - 29/Nov/07 06:17 PM
Attachment balancer4.patch [ 12370565 ]
Hairong Kuang added a comment - 04/Dec/07 06:50 PM
The patch incorporates last review comments from Sanjay.

Hairong Kuang made changes - 04/Dec/07 06:50 PM
Attachment balancer5.patch [ 12370962 ]
Hairong Kuang made changes - 04/Dec/07 06:50 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hairong Kuang added a comment - 04/Dec/07 07:48 PM
This is an updated user guide.

Hairong Kuang made changes - 04/Dec/07 07:48 PM
Attachment BalancerUserGuide2.pdf [ 12370966 ]
Hadoop QA added a comment - 04/Dec/07 08:22 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12370962/balancer5.patch
against trunk revision r601038.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs -1. The patch appears to introduce 9 new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests -1. The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/console

This message is automatically generated.


Sanjay Radia added a comment - 04/Dec/07 09:49 PM
Code looks good
+1

Hairong Kuang made changes - 04/Dec/07 10:09 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Hairong Kuang added a comment - 04/Dec/07 10:10 PM
The patch fixed the findbugs errors.

Hairong Kuang made changes - 04/Dec/07 10:10 PM
Attachment balancer6.patch [ 12370975 ]
Hairong Kuang made changes - 04/Dec/07 10:10 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 05/Dec/07 12:14 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12370975/balancer6.patch
against trunk revision r601111.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs -1. The patch appears to introduce 2 new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests +1. The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/console

This message is automatically generated.


Hairong Kuang made changes - 05/Dec/07 12:46 AM
Status Patch Available [ 10002 ] Open [ 1 ]
Hairong Kuang added a comment - 05/Dec/07 12:52 AM
findbugs fix.

Hairong Kuang made changes - 05/Dec/07 12:52 AM
Attachment balancer7.patch [ 12370987 ]
Hairong Kuang made changes - 05/Dec/07 12:52 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 05/Dec/07 03:16 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12370987/balancer7.patch
against trunk revision r601111.

@author +1. The patch does not contain any @author tags.

javadoc +1. The javadoc tool did not generate any warning messages.

javac +1. The applied patch does not generate any new compiler warnings.

findbugs +1. The patch does not introduce any new Findbugs warnings.

core tests +1. The patch passed core unit tests.

contrib tests -1. The patch failed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/console

This message is automatically generated.


Hairong Kuang added a comment - 05/Dec/07 06:46 PM
The patch has a minor change to make the junit test to run faster.

Hairong Kuang made changes - 05/Dec/07 06:46 PM
Attachment balancer8.patch [ 12371057 ]
Repository Revision Date User Message
ASF #601491 Wed Dec 05 19:49:16 UTC 2007 dhruba HADOOP-1652. A utility to balance data among datanodes in a HDFS cluster.
(Hairong Kuang via dhruba)
Files Changed
ADD /lucene/hadoop/trunk/bin/stop-balancer.sh
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/DFSClient.java
MODIFY /lucene/hadoop/trunk/src/test/org/apache/hadoop/dfs/MiniDFSCluster.java
ADD /lucene/hadoop/trunk/bin/start-balancer.sh
MODIFY /lucene/hadoop/trunk/bin/hadoop
MODIFY /lucene/hadoop/trunk/conf/hadoop-default.xml
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/DataNode.java
ADD /lucene/hadoop/trunk/src/test/org/apache/hadoop/dfs/TestBalancer.java
ADD /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/Balancer.java
MODIFY /lucene/hadoop/trunk/CHANGES.txt

dhruba borthakur added a comment - 05/Dec/07 07:49 PM
I just committed this. Thanks Hairong!

dhruba borthakur made changes - 05/Dec/07 07:49 PM
Resolution Fixed [ 1 ]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Hudson added a comment - 06/Dec/07 11:59 AM

Jim Kellerman made changes - 15/Jan/08 07:42 PM
Link This issue blocks HADOOP-2138 [ HADOOP-2138 ]
Nigel Daley made changes - 08/Feb/08 11:37 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]
Ravi Phulari made changes - 13/Oct/09 09:31 PM
Attachment Balancer.html [ 12422024 ]