Today the ExternalBlockHandler component of ESS exposes some useful metrics, but is lacking around metrics for the rate of block transfers. We have blockTransferRateBytes to tell us the rate of bytes, but no metric to tell us the rate of blocks, which is especially relevant when running the ESS on HDDs that are sensitive to random reads. Many small block transfers can have a negative impact on performance, but won't show up as a spike in blockTransferRateBytes since the sizes are small.
We can also enhance YarnShuffleServiceMetrics to expose histogram-style metrics from the Timer instances within ExternalBlockHandler – today it is only exposing the count and rate, but not timing information from the Snapshot.
These two changes can make it easier to monitor the health of the ESS.