Currently MetricsConsumerBolt is delegating MetricsConsumer to handle data points via synchronous manner.
When MetricsConsumer cannot keep up, it will trigger backpressure when (queue size + overflow buffer size) reaches high watermark, which incurs slowing down the topology in result.
Slowing down Itself is not a problem because that’s what backpressure is for. The actual problem is that backpressure only throttles spout, not metrics. If MetricsConsumerBolt cannot keep up with incoming tuples, backpressure never ends and topology just hangs. If we turn off backpressure, we have unbounded queue and worker could throw OOME eventually.
Making MetricsConsumerBolt asynchronous can resolve this issue. One downside of making it async is that it's hard to see that MetricsConsumerBolt is keeping up now. (capacity will be always around 0)
I don't have an idea for now but I think it's still better than current.
Before making consensus about huge change of metrics, I'd love to improve current metrics without breaking backward compatible manner. It could be applied to 1.x-branch, and even 0.10.x-branch.