|
Some more thoughts for discussion...
1. Put the cluster in safe mode while rebalacing. This allows us to more aggressively schedules block moving tasks but it interrupts the current running of the cluster. A destination datanode should also have an upper limit on the number of blocks to be scheduled to receive. Currently in dfs when re-replicating blocks, blocks are pushed from source to destination. It enforces max number of concurrent block transfers for the source, but it does not have any limit on the destination side. So a destination data node may end up receiving many blocks concurrently. Shall we also limit the number of concurret writes on a data node?
A balanced cluster, in terms of disk space usage, should be one in which the percentage of used disk space is balanced. Please take a look at HADOOP-1530.
Yes, I agree with you. My definition of a balanced cluster was in term of %used space.
When there is no over-capacity data node in the cluster, a source data node should be chosen from data nodes whose %used space is above average %used space, among which we should favor data nodes that are on the same rack as the target data node. I am not sure if we should favor fuller data nodes or not because it is more expensive than a random selection, but it makes the algorithm to converge faster. This raised another question that how we can guarantee that the algorithm converges if rebalancing is not done in the safe mode.
This is a tradeoff though... one of the reasons for recording what racks blocks are on is to prevent putting "all of your eggs in one basket" so to speak (See the middle paragraph of http://lucene.apache.org/hadoop/hdfs_design.html#Data+Replication The reason that I'd like to favor data nodes that are on the same rack as the destination node is that dfs does not need to check if "all of your eggs are in one basket"
I'd say, semantically, that a stronger definition of a balanced cluster is one in which the percentage of free disk-space (across all available partitions, per-node) is balanced.
I know, it's a bit of a word-play, but yet ... smile Arun, yes, I agree. This can be done by instructing data nodes to balance its partitions. I'd like to do it in a different jira.
This is a more detailed design document for the rebalancing tool. A major change is that the rebalancing decisions are made in a seperate process from name node. Data nodes are throttled to prevent rebalancing using too much network bandwidth. I also add the protocol design, race condition discusssion, and a test plan.
Attached is a new version of the design document which includes the following changes:
1. A block is pulled by a destination instead of pushed from a source. 2. A block can be pulled from a proxy source which contains the block and is closer to the destination or less loaded than the source. 3. A source does not delete a block by itself. Instead a destination notifies the name node that a block is copied and a hint that a replica at the source should be removed if the block is over replicated. When the name node decides to remove an excessive replica, it favors the hint as long as doing so does not violate the rebalancing requirement 1. A source removes the block upon a request from the name node. Upload the most updated design document.
Should we perform rebalance if the cluster is not 'finalizeUpgrade' d?
If yes, maybe one test case for this? Currently balancer does not check if the cluster is "finalizedUpgrage" d or not. I can run a test to see what's going on if reblancing is performed when a cluster is upgraded but not finalized yet.
The balancer administrator guide is attached.
Here is a new patch that incorporates Sanjay's first pass of review comments.
This patch incorported the second round of review comments from Sanjay. Particularly it adds the following features:
1. Disallow more than one balancer running in an HDFS; 2. Each balancing iteration runs no more than 20 minutes; 3. Disallow a block to move more than once during the whole process of balancing; 4. Each block move has a timeout; 5. Each line of output has a timestamp. Here is a new administrator guide that reflects the change to the balancer.
I added one more change to the balancer. It shuffles the datanode array before constructing overUtlilizedDatanodeList, underUtilizedDatanodeList etc. This adds some randomness to the static source/target datanode match.
This is a new patch that's built from the most recent trunk. It removed the part of the code that gets comitted in
The patch incorporates last review comments from Sanjay.
This is an updated user guide.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12370962/balancer5.patch against trunk revision r601038. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 9 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1259/testReport/ This message is automatically generated. The patch fixed the findbugs errors.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12370975/balancer6.patch against trunk revision r601111. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 2 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1262/testReport/ This message is automatically generated. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12370987/balancer7.patch against trunk revision r601111. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1264/testReport/ This message is automatically generated. The patch has a minor change to make the junit test to run faster.
I just committed this. Thanks Hairong!
Integrated in Hadoop-Nightly #324 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/324/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. What's balance?
A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold.
A threshold is user configurable. A default value could be 20% of % used space.
2. When to rebalance?
Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
Hadoop dfsadmin balance <start/stop/get>
-----Start/stop data block rebalancing or query its status.
3. How to balance?
(a) Upon receiving a data block rebalancing request, a name node creates a Balancing thread.
(b) The thread performs rebalancing iteratively.
(c) The thread stops and frees itself when the cluster becomes balanced.