There are two key metrics which I think we lack and which would be really great to help with scaling visibility in HMS.
Total API calls duration stats
We already compute and log the duration of API calls in the PerfLogger. We don't have any gauge or timer on what the average duration of an API call is for the past some bucket of time. This will give us an insight into if there is load on the server which is increasing the average API response time.
Connection Pool stats
We can use different connection pooling libraries such as bonecp or hikaricp. These pool managers expose statistics such as average time waiting to get a connection, number of connections active, etc. We should expose this as a metric so that we can track if the the connection pool size configured is too small and we are saturating!
These metrics would help catch problems with HMS resource contention before they actually have jobs failing.