Currently decommissioning machines in HA-enabled cluster requires running refreshNodes in both active and standby nodes. Sometimes decommissioning won't finish from standby NN's point of view. Here is the diagnosis of why it could happen.
Standby NN's blockManager manages blocks replication and block invalidation as if it is the active NN; even though DNs will ignore block commands coming from standby NN. When standby NN makes block operation decisions such as the target of block replication and the node to remove excess blocks from, the decision is independent of active NN. So active NN and standby NN could have different states. When we try to decommission nodes on standby nodes; such state inconsistency might prevent standby NN from making progress. Here is an example.
1. For a given block, both active and standby have 5 replicas on machine A, B, C, D, E. So both active and standby decide to pick excess nodes to invalidate.
Active picked D and E as excess DNs. After the next block reports from D and E, active NN has 3 active replicas (A, B, C), 0 excess replica.
Standby pick C, E as excess DNs. Given DNs ignore commands from standby, After the next block reports from C, D, E, standby has 2 active replicas (A, B), 1 excess replica (C).
2. Machine A decomm request was sent to standby. Standby only had one live replica and picked machine G, H as targets, but given standby commands was ignored by DNs, G, H remained in pending replication queue until they are timed out. At this point, you have one decommissioning replica (A), 1 active replica (B), one excess replica (C).
3. Machine A decomm request was sent to active NN. Active NN picked machine F as the target. It finished properly. So active NN had 3 active replicas (B, C, F), one decommissioned replica (A).
4. Standby NN picked up F as a new replica. Thus standby had one decommissioning replica (A), 2 active replicas (B, F), one excess replica (C). Standby NN kept trying to schedule replication work, but DNs ignored the commands.