Thinking more about this one, we can exit safemode faster if we can compute misReplicatedBlocks even before we have one replica of all blocks.
Step 1: the namenode waits to ensure that there is at least one replica of all known blocks.
Step 2: Then it invokes processMisReplicatedBlocks to update neededReplication
When the cluster restarts, the namenode starts in Step 1 and starts processing a storm of block reports from all datanodes. But a few datanodes are somewhat slow and the block report from the straggler datanodes delays the transition from Step 1 to Step 2. The CPU usage on the NN decreases exponentially as Step 1 progresses and becomes almost negligible when Step 1 is about to end.
This jira could change the code so that processMisReplicatedBlocks is invoked before Step 1 finishes completely. This will make the NN exit safemode earlier