Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: metrics, ops-tooling
    • Labels:
      None
    • Target Version/s:

      Description

      Recently I was debugging a cluster that appeared to have network issues. Only after lots of investigation did I realize that the reactor threads were not keeping up with network traffic due to hitting KUDU-1964 (this cluster was running 1.3.0). At first glance the reactors did not seem busy, since each was only using ~25% of a CPU – however, the other 75% of the time was spent blocked on OpenSSL locks and not in epoll_wait as one would normally expect.

      This would be easier to diagnose if we had a metric showing the amount of time the reactors spend idle (ie in epoll_wait) vs doing work (ie executing callbacks, IO, etc). If any reactor is spending a high percentage of time not in epoll, that suggests the reactors may be a bottleneck and increasing latency or degrading throughput.

        Activity

        Hide
        tlipcon Todd Lipcon added a comment -

        One slight wrinkle for this metric is that, if there are multiple reactors, there may be skew such that only one is "overloaded". We should still expose this somehow rather than exposing an average across the reactors.

        Show
        tlipcon Todd Lipcon added a comment - One slight wrinkle for this metric is that, if there are multiple reactors, there may be skew such that only one is "overloaded". We should still expose this somehow rather than exposing an average across the reactors.
        Hide
        tlipcon Todd Lipcon added a comment -

        Another way to get at the same kind of info might be to measure the actual latency between submitting a task to the ReactorTask queue and that task actually being executed. If we exposed this as a histogram we would probably be able to see if the reactor is responding slowly due to some reason or another, which would lead us more quickly to start pstacking or profiling the reactor.

        Show
        tlipcon Todd Lipcon added a comment - Another way to get at the same kind of info might be to measure the actual latency between submitting a task to the ReactorTask queue and that task actually being executed. If we exposed this as a histogram we would probably be able to see if the reactor is responding slowly due to some reason or another, which would lead us more quickly to start pstacking or profiling the reactor.

          People

          • Assignee:
            tlipcon Todd Lipcon
            Reporter:
            tlipcon Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development