I'm wondering what our options are here. We can't just disable the logging; there is the possibility that only the MetricQueryService is unreachable and this should be logged if that's the case.
We could limit the # of log messages in a given time frame, but this would mean that an unreachable MQS may only be logged after a long long time.
Finally, we could track the unreachable status of the MQS for each TaskManager; like a set that contains the paths. If a request fails it is added to the set, and we only log something when it is added to the set. Once a request succeeds it would be removed again. Problem is that we then would need some time-based clean-up code as the set could otherwise grow infinitely in cases where many TM's are being replaced (and thus are never reachable again).
Sadly there isn't something like a TaskmanagerStatusListener interface, this would be useful to track/clean-up state by TaskManager.