[HDFS-16614] Improve balancer operation strategy and performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.9.2
Fix Version/s: None
Component/s: balancer & mover, namenode
Labels:
None

Description

When the Balancer program is run, it does some work in the following order:
1. Obtain available datanode information from NameNode.
2. Classify and calculate the average utilization according to StorageType. Here, some sets will be obtained in combination with the set thresholds: overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized.
3. According to some calculations, the source and target related to the transfer data are obtained. The source is used for the source end, and the target is used for the data receiving end.
4. Start the data transfer work in parallel.
In this process, run iteratively. In this process, the threshold is unified and applied to all StorageTypes, which seems to be a bit rough, because one of the StorageTypes cannot be distinguished, which is based on the currently supported heterogeneous storage.

There is an online cluster with more than 2000 nodes, and there is an imbalance in node storage. E.g:

Here, the average utilization of the cluster is 78%, but the utilization of most nodes is between 85% and 90%. When the balancer is turned on, we find that 85% of the nodes are working as sources. In this case, we think it is not reasonable, because it will occupy more network resources in the cluster, and it will be beneficial to the normal work of the cluster to do some effective restrictions.
So here are some changes to make:
1. When the balancer is running, we should actively prompt the suggested value of the threshold related to StorageType. For example: [[DISK, 10%], [SSD, 8%]...]
2. Support to set threshold according to StorageType and work.
3. Add an option to prohibit nodes below the threshold from joining the Source set. This is to allow nodes with high utilization to transfer data as soon as possible, which is good for balance.
4. Add new support. If there are a lot of datanode usage in the cluster, it should remain unchanged. For example, the utilization rate of 40% of the nodes in the cluster is 75% to 80%, and these nodes should not join the Source set. Of course this support needs to be specified by the user at runtime.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2022-06-02-13-18-33-213.png
02/Jun/22 05:18
66 kB
JiangHua Zhu

Activity

People

Assignee:: JiangHua Zhu

Reporter:: JiangHua Zhu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Jun/22 05:18

Updated:: 02/Jun/22 05:24