Suresh's comments in
However for the write site, not picking the stale node could result in an issue, especially for small clusters. That is the reason why I think we should do the write side changes in a related jira. We should consider making stale timeout adaptive to the number of nodes marked stale in the cluster as discussed in the previous comments. Additionally we should consider having a separate configuration for write skipping the stale nodes.
The more detailed proposal for handling write is:
For writes do not use stale datanodes (if possible). To avoid the scenario where a small T for judging stale state may generate new hotspots on cluster, T is proposed to be calculated as:
T = t_c + (number of nodes already marked as stale) / (total number of nodes) * (T_d - t_c),
where t_c is a constant value initially set in the configuration, and T_d is the time for marking as dead (i.e., 10.5 min).
E.g., t_c can be set as 30s, then when there is no or few nodes marked as stale, we can have a small T to satisfy the HBase requirement. In case that there are large number nodes marked as stale, e.g., near the total number of nodes, T will be almost T_d (i.e., ~10min), and the workload can still be distributed to all the nodes alive.
When almost all nodes are marked as stale, include stale nodes as writing target candidates when the number of remaining normal alive nodes is less than the replica number.