Thanks Arpit Agarwal for working on this and all for the discussion. I have the following comments on the production side changes. Still reviewing the unit test changes and will post my comments on that soon.
NIT: Line 848: "&& mirrorAddr != null" can be removed.
Line 849: can be simplified with "peerMetrics.addSendPacketDownstream"
Line 1146: NIT: heatbeatTime can be changed to slowPeerReportTime or remove the
parameter by hiding the montonicNow() call inside scheduleNextSlowPeerReport().
Line 52-53: NIT: avoid import *
Line 180. The comments seems incomplete.
Line 212. we should instantiate slowPeerTracker only if dataNodePeerStatsEnabled
Line 1653-1660: NIT: can we tweak the code to avoid calling slowPeers.getSlowPeers()
multiple times in the worst case and maybe avoid the if (LOG.isDebugEnabled()) with
Line 1659: can we use nodeinfo.getIpcAddr() sicne the datanode
Line 142-143: Correct me if I'm wrong, looks like the comments is for stats Map
in Line 137.
Line 398-405. This is a very good document. Can we add a field indicating the
DN aggregate mechanism? This way the NN can enforce consistent aggregation
across all the datanodes. This can be done in a separate ticket.
Line 677: document for dfs.datanode.slow.peers.report.interval? We can open
separate ticket for it.
Great catch on some missing synchronized on rollOverAvgs.
NIT: Line 264: missing @param for minSamples
Line 99-108: We can make this an interface to allow different aggregation methods
(median, 90th percentile) for outlier detection. This can be done in a separate ticket.
We can also use Median/Percentile class from apache common to implement
Line 127: we need to guard the tracing with if (LOG.isTraceEnabled()) to
avoid the implicit sorted.toString() overhead.
Line 44: NIT: typo consistenly -> consistently
Line 144: NIT: the document needs to update to match the code which returns
a map -> sortedset of string.
Line 190: Can we make MAX_NODES_TO_REPORT configurable? This can be fixed in
a separate ticket.