Here are some of my initial thoughts. Please comment.
1. What's balance?
A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster.
An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold.
A threshold is user configurable. A default value could be 20% of % used space.
2. When to rebalance?
Rebanlancing is performed on demand. An administrator issues a command to trigger rebalancing. Rebalancing automatically shuts off once the cluster is balanced and can also be interrupted by an administrator. The following commands are to be supported:
Hadoop dfsadmin balance <start/stop/get>
-----Start/stop data block rebalancing or query its status.
3. How to balance?
(a) Upon receiving a data block rebalancing request, a name node creates a Balancing thread.
(b) The thread performs rebalancing iteratively.
- At each iteration, it scans the whole data node list and schedules block moving tasks. It sleeps for a heartbeat interval between iterations;
- When scanning the data node list, if it finds an under-capacity data node, it schedules moving blocks to the data node. The source data node is chosen randomly from over-capacity data nodes or non-under-capacity data nodes if no over-capacity data node exists. The source block is randomly chosen from the source data node as long as the block moving does not violate requirement (1).
- If the thread finds an over-capacity data node, it scheduls moving blocks from the data node to other data nodes. It chooses a target data node randomly from under-capacity data nodes or non-over-capcity data nodes when there is no under-capacity data node; It then randomly chooses a source block that does not violate requirement (1).
- The scheduled tasks are put to a queue in the source data node. The task queue has a limited length of 4 by default and is configurable.
- The scheduled tasks are sent to data nodes to execute in responding to a heartbeat message. Currently dfs limits at most 2 tasks per heartbeat by default.
(c) The thread stops and frees itself when the cluster becomes balanced.