I changed the blurb rate to 60 seconds around August 2008 on Y! clusters.
The blurb period (for metrics, config blurb is on another period) was actually still 5 seconds in metrics1, when we were deploying metrics2 (where we use the default blurb period 10 second) in 2010 on Y clusters. Rajiv can confirm this. Are you saying simon aggregator could not process less than 1k udp packets per second? In any case, the throughput I saw (a few months ago) on the simon aggregator is way more than that. Rajiv said that the limiting factor is not the udp packets processing at aggregator level but the iops to store the data.
The Simon plugin is only doing add and average of samples.
I'm sure you meant simon aggregator. It also does user defined calculations (defined in the simon config file), if you lose the sole udp packet in the reporting period, the derived metrics will not be correct, so you need a couple of samples at least in the reporting period. While MetricVaryingRate in metrics1 and MutableRate in metrics2 do averaging and compute throughput, which are used mostly in rpc related metrics, most metrics in mapred are counters and gauages and almost all the mapred throughput metrics (*PerSec) are actually derived metrics from the simon config. This approach half the packet size vs using the *Rate metrics in metrics sources. Simon sinks send one packet per update, unlike ganglia, which sends one packet per metric per update.
Are you concerning that the metrics might overflow if the publish rate is at 60 seconds?
No. Even if some of them do, it's easy to see and explain on the graphs. All metrics backend with rrdtools should handle counter wraps automatically.
As a side benefit, by reducing the period, it is less amount of cycle spend in metrics monitoring, which makes the system more efficient.
At least with metrics2, which is more efficient than metrics1, even if the period is 1 second, it has no noticeable impact on system performance last time I checked, as the additional a few hundred additional objects per second in the timer thread is mostly noise compared with overall gc and context switching throughput on busy servers.
My point is that you should not change the current default that has potential impact on production monitoring without actually testing it at scale.