Consider a large cluster that takes 40 minutes to start up. The datanodes compete to register and send their Initial Block Reports (IBRs) as fast as they can after startup (subject to a small sub-two-minute random delay, which isn't relevant to this discussion).
As each datanode succeeds in sending its IBR, it schedules the starting time for its regular cycle of reports, every hour (or other configured value of dfs.blockreport.intervalMsec). In order to spread the reports evenly across the block report interval, each datanode picks a random fraction of that interval, for the starting point of its regular report cycle. For example, if a particular datanode ends up randomly selecting 18 minutes after the hour, then that datanode will send a Block Report at 18 minutes after the hour every hour as long as it remains up. Other datanodes will start their cycles at other randomly selected times. This code is in DataNode.blockReport() and DataNode.scheduleBlockReport().
The "second Block Report" (2BR), is the start of these hourly reports. The problem is that some of these 2BRs get scheduled sooner rather than later, and actually occur within the startup period. For example, if the cluster takes 40 minutes (2/3 of an hour) to start up, then out of the datanodes that succeed in sending their IBRs during the first 10 minutes, between 1/2 and 2/3 of them will send their 2BR before the 40-minute startup time has completed!
2BRs sent within the startup time actually compete with the remaining IBRs, and thereby slow down the overall startup process. This can be seen in the following data, which shows the startup process for a 3700-node cluster that took about 17 minutes to finish startup:
This data was harvested from the startup logs of all the datanodes, and correlated into one-minute buckets. Each row of the table represents the progress during one elapsed minute of clock time. It seems that every cluster startup is different, but this one showed the effect fairly well.
The "starts" column shows that all the nodes started up within the first 2 minutes, and the "regs" column shows that all succeeded in registering by minute 6. The IBR column shows a sustained rate of Initial Block Report processing of 250-300/minute for the first 10 minutes.
The question is why, during minutes 11 through 16, the rate of IBR processing slowed down. Why didn't the startup just finish? In the "2nd_BR" column, we see the rate of 2BRs ramping up as more datanodes complete their IBRs. As the rate increases, they become more effective at competing with the IBRs, and slow down the IBR processing even more. After the IBRs finally finish in minute 16, the rate of 2BRs settles down to a steady ~60-70/minute.
In order to decrease competition for locks and other resources, to speed up IBR processing during startup, we propose to delay 2BRs until later into the cycle.