Some comments on performance considerations...
I am attaching code (FSLockPerf.java) that I used to do somewhat rudimentary microbenchmarking. It's not perfect but hopefully it gives a bit of an idea of what kind of overhead this may incur. If anyone is interested in seeing other numbers let me know and I will do my best to generate them. Note that this feature is disabled by default, so no overhead is incurred for those not actively opting in to the feature.
I include two different tests, "overall" and "aggTime". In both I focus on the worst case scenario in which all threads are reader threads, i.e. they are not hindered by the Namesystem lock and contend solely on metrics. In both cases I use 200 threads to model what would occur in a highly contested system. Also, all aggregations involve 50 operations, emulating 50 distinct operation types occurring at each thread since the last aggregation, which seems a conservatively high upper bound since most operations are uncommon.
overall tries to be more wholistic but involves a higher degree of variability since there are actually locks being held and such. This test sets the aggregation interval at various intervals (including completely disabled and a high enough interval that aggregation is never triggered) and tests the overall time it takes each of the 200 threads to complete 500,000 cycles of read lock/unlock (including all metrics-related operations). Over 1,000 iterations I got:
Agg Interval Total Time MS (Avg) Total Time MS (StdDev)
0 30518 1777
9999999 30825 1673
20000 30183 1709
10000 30272 1681
5000 30278 1740
1000 30307 1702
10 30350 1692
Clearly the metrics processing fits within the noise of locking and such, especially given that the average of the runs with the logic disabled ended up being higher than with the logic enabled. Still, these results were not very satisfying, so I tried to be more specific with aggTime.
aggTime is the more narrow of the two. I assume the local tracking of metrics is very cheap, simply incrementing a counter within a ThreadLocal, so I focus on the time to do the more expensive aggregate (involving a synchronized method to update the MutableRate metric). First I run a test with only a single thread updating metrics, then do the full 200 threads under a few different conditions: turning on and off aggregation (to get a baseline figure of performance with many threads running), and including a 1-millisecond sleep between operations (to emulate slightly less pessimistic conditions of lock contention). Each thread does 10,000 aggregations and I measure the time per operation; over 10 trials I got:
10000 aggregation per thread over 100 trials
Test Average Time (ns) Std Dev (ns)
Single Thread 3107 606
No Agg, No Wait 551 551
Agg, No Wait 235850 24059
No Agg, Wait 1065525 625
Agg, Wait 1158477 8743
So it seems that even under highly contested conditions an aggregation adds ~100-200 microseconds to the execution path, and without contention only ~3-4 microseconds. Given that a typical aggregation period would be the same as the metrics collection interval, say 10-60 seconds, this seems reasonable for a disabled-by-default feature.