A DataNode typically sends a heartbeat to the NameNode every 3 seconds. We expect heartbeat handling to complete relatively quickly. However, if something unexpected causes heartbeat processing to get blocked, such as a long GC or heavy lock contention within the NameNode, then heartbeat processing would be delayed. After recovering from this delay, the DataNode then starts sending a storm of heartbeat messages in a tight loop. In a large cluster with many DataNodes, this storm of heartbeat messages could cause harmful load on the NameNode and make overall cluster recovery more difficult.
The bug appears to be caused by incorrect timekeeping inside BPServiceActor. The next heartbeat time is always calculated as a delta from the previous heartbeat time, without any compensation for possible long latency on an individual heartbeat RPC. The only mitigation would be restarting all DataNodes to force a reset of the heartbeat schedule, or simply wait out the storm until the scheduling catches up and corrects itself.
This problem would not manifest after a NameNode restart. In that case, the NameNode would respond to the first heartbeat by telling the DataNode to re-register, and BPServiceActor#reRegister would reset the heartbeat schedule to the current time. I believe the problem would only manifest if the NameNode process kept alive, but processed heartbeats unexpectedly slowly.