Details
Description
Large production clusters are likely to have heterogeneous nodes in terms of storage capacity, memory, and CPU cores. It is not always possible to proportionally ingest data into DataNodes based on their remaining storage capacity. Therefore it's possible for a subset of DataNodes to be much closer to full capacity than the rest.
This heterogeneity is most likely rack-by-rack – i.e. m whole racks of low-storage nodes and n whole racks of high-storage nodes. So It'd be very useful if we can lower the chance for those near-full DataNodes to become destinations for the 2nd and 3rd replicas.