The default value dfs.image.transfer.bandwidthPerSec is set to 0 so it can use maximum available bandwidth for fsimage transfers during checkpoint. I think we should throttle this. Many users were experienced namenode failover when transferring large image size along with fsimage replication on dfs.namenode.name.dir. eg. >25Gb.
Thought to set,
dfs.image.transfer.bandwidthPerSec=52428800. (50 MB/s)
dfs.namenode.checkpoint.txns=2000000 (Default is 1M, good to avoid frequent checkpoint. However, the default checkpoint runs every 6 hours once)