The reason we need this is that if many jobs have short tasks, the job tracker can get beat up with too many heartbeats.
I think that the patch should have two pieces.
1: In any one node, we should delay an out-of-band heartbeat that we are considering sending but that would otherwise occur too soon after the most recent heartbeat, in the hopes of reporting multiple task attempt completions in one heartbeat thus reducing the total load placed on the job tracker. This involves compromises, because the node won't get a new task immediately.
2: We should cap the total number of heartbeats over a time interval. The cap and the interval should be configurable. If that interval is INT and the cap is C, we should track the times of the last C heartbeats we sent, and if the time T of the oldest one is less than INT ago and we otherwise meet the criteria for sending a heartbeat we should unconditionally send one at time T + INT rather than immediately.
Since principle 2 may induce a longish delay, perhaps each heartbeat should say when the next heartbeat should occur? This makes this patch a bigger deal because up to now all changes could be localized to the TaskTracker but now they can't, but it might be worthwhile.