Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: metrics, ops-tooling
    • Labels:
      None
    • Target Version/s:

      Description

      Recently I was debugging a cluster that appeared to have network issues. Only after lots of investigation did I realize that the reactor threads were not keeping up with network traffic due to hitting KUDU-1964 (this cluster was running 1.3.0). At first glance the reactors did not seem busy, since each was only using ~25% of a CPU – however, the other 75% of the time was spent blocked on OpenSSL locks and not in epoll_wait as one would normally expect.

      This would be easier to diagnose if we had a metric showing the amount of time the reactors spend idle (ie in epoll_wait) vs doing work (ie executing callbacks, IO, etc). If any reactor is spending a high percentage of time not in epoll, that suggests the reactors may be a bottleneck and increasing latency or degrading throughput.

        Attachments

          Activity

            People

            • Assignee:
              tlipcon Todd Lipcon
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: