Recently I was debugging a cluster that appeared to have network issues. Only after lots of investigation did I realize that the reactor threads were not keeping up with network traffic due to hitting
KUDU-1964 (this cluster was running 1.3.0). At first glance the reactors did not seem busy, since each was only using ~25% of a CPU – however, the other 75% of the time was spent blocked on OpenSSL locks and not in epoll_wait as one would normally expect.
This would be easier to diagnose if we had a metric showing the amount of time the reactors spend idle (ie in epoll_wait) vs doing work (ie executing callbacks, IO, etc). If any reactor is spending a high percentage of time not in epoll, that suggests the reactors may be a bottleneck and increasing latency or degrading throughput.