[HDFS-13739] Add option to disable rack local write preference - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.3
Fix Version/s: 3.3.0
Component/s: balancer & mover, block placement, datanode, fs, hdfs, hdfs-client, namenode, nn, performance
Labels:
None
Environment:

Hortonworks HDP 2.6

Hadoop Flags:

Reviewed

Description

Request to be able to disable Rack Local Write preference / Write All Replicas to different Racks.

Current HDFS write pattern of "local node, rack local node, other rack node" is good for most purposes but there are at least 2 scenarios where this is not ideal:

Rack-by-Rack Maintenance leaves data at risk of losing last remaining replica. If a single datanode failed it would likely cause some data outage or even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack rebuild). Setting replicas to 4 would reduce write performance and waste storage which is currently the only workaround to that issue.
Major Storage Imbalance across datanodes when there is an uneven layout of datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack will only choose to send half their block replicas to each other, so they will fill up first, while other nodes will receive far fewer replica blocks:

NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6

In this case if I reduce the number of replicas to 2 then I get an almost perfect spread of blocks across all datanodes because HDFS has no choice but to maintain the only 2nd replica on a different rack. If I increase the replicas back to 3 it goes back to 85% on half the nodes and 50% on the other half, because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily loaded HBase cluster - aside from destroying HBase's data locality and performance by moving blocks out from underneath RegionServers - as soon as an HBase major compaction occurs (at least weekly), all blocks will get re-written by HBase and the HDFS client will again write to local node, rack local node, other rack node - resulting in the same storage imbalance again. Hence this cannot be solved by running HDFS balancer on HBase clusters - or for any application sitting on top of HDFS that has any HDFS block churn.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-13739-01.patch
03/Jun/19 03:06
10 kB
Ayush Saxena

Issue Links

is related to

HDFS-9090 Write hot data on few nodes may cause performance issue

Open

HDFS-13720 HDFS dataset Anti-Affinity Block Placement across all DataNodes for data local task optimization (improve Spark executor utilization & performance)

Open

HDFS-7541 Upgrade Domains in HDFS

Resolved

Activity

People

Assignee:: Ayush Saxena

Reporter:: Hari Sekhon

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 17/Jul/18 09:35

Updated:: 19/Feb/20 03:27

Resolved:: 19/Feb/20 03:02