Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-9002

Add New Statistic Type For /proc/schedstat

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • statistics

    Description

      Linux performance icon Brendan Gregg advocates the USE method of performance analysis: Utilization Saturation and Errors.

      When it comes to CPU, Geode captures a number of utilization statistics. Some are direct like LinuxSystemStats cpuIdle and cpuActive. Others are indirect like:

      • DistributionStats
        • heartbeatsSent: you may see a gap in the every-five-seconds heartbeats
      • StatSampler
        • delayDuration: you may see a rise when CPU is scarce
        • sampleCount: you may see an interruption in the regular once-per-second sampling
      • (G1GC collector)
        • (various memory utilization statistics may indicate memory pressure which in turn can give rise to long GC pauses)
      • LinuxSystemStats
        • cpuSteal: indicating that the virtualization environment has not given the VM its share of CPU

       

      But utilization statistics alone can't tell you when a resource (like CPU) is saturated, i.e. when  demand is higher than the servicing ability. If you're just looking at utilization metrics, then a saturated system might look a lot like a system just below saturation. In order to tell the difference, saturation metrics are needed.

      In the case of CPU, there is a conceptual queue in front of each processor. Tasks (operating system threads) that are ready to run, enter a queue, and after some delay, are given a time slice by an actual physical CPU.

      You might think that Geode's LinuxSystemStats loadAverage1 and 5 and 15, might fit this bill. Those statistics do provide some saturation information. The problem is, they conflate CPU with I/O and other things (see [Linux Load Averages: Solving the Mystery|http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html.)]

      A better, more specific measure of CPU saturation is available through statistics exposed via the /proc/schedstat virtual file.

      When this ticket is complete, there will be a new statistic type called LinuxThreadScheduler, with three four associated statistics gathered directly from /proc/schedstat or derived from data gathered from it:

      • runningTimeNanos: sum of all time spent running by tasks on this processor in nanoseconds
      • queuedTimeNanos: sum of all time spent waiting to run by tasks on this processor in nanoseconds
      • tasksScheduledCount: # of tasks (not necessarily unique) given to the processor
      • meanTaskQueuedTimeNanos: average time that a ready-to-run task waited for a CPU, since the last sample, in nanoseconds

      One "statistic" will be gathered for each CPU. So a Geode process running on a two-CPU system will capture two statistics, called "cpu0", "cpu1", each of this new type.

      By default Geode will not gather these new statistics. A TBD Java system property will be used to enable gathering the new LinuxThreadScheduler statistic.

      Attachments

        Issue Links

          Activity

            People

              burcham Bill Burcham
              burcham Bill Burcham
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: