In our largest prod cluster (running 2.8.2) we have >3k hosts. Every time when rolling restarting NNs we will need to wait for block report which takes >2.5 hours for each NN.
One way to make it faster is to manually trigger a full block report from all datanodes. HDFS-7278. However, the current triggerBlockReport command will trigger a block report on all NNs which will flood the active NN as well.
A quick solution will be adding an option to specify a NN that the manually triggered block report will go to, something like:
hdfs dfsadmin [-triggerBlockReport [-incremental] <datanode_host:ipc_port>] [-namenode] <namenode_host:ipc_port>
So when doing a restart of standby NN or observer NN we can trigger an aggressive block report to a specific NN to exit safemode faster without risking active NN performance.