Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-24306

Ambari Metrics + Grafana - add LastGcInfo duration graphs for all server components for all GCs - G1GC Young + Old Gens, CMS and ParallelNew

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ambari-metrics, metrics
    • Labels:
      None

      Description

      Feature Request to add Grafana graph of last value (not average please) LastGcInfo duration for all 3 major garbage collectors :

      • G1GC Young Gen
      • G1GC Old Generations
      • CMS
      • ParallelNew

      CMS and ParNew example taken from NameNode JMX metrics:

        }, {
          "name" : "java.lang:type=GarbageCollector,name=ConcurrentMarkSweep",
          "modelerType" : "sun.management.GarbageCollectorImpl",
          "LastGcInfo" : {
            "GcThreadCount" : 11,
            "duration" : 5206,
      ...
        }, {
          "name" : "java.lang:type=GarbageCollector,name=ParNew",
          "modelerType" : "sun.management.GarbageCollectorImpl",
          "LastGcInfo" : {
            "GcThreadCount" : 11,
            "duration" : 6,
       

      G1GC Young and Old Gen example taken from RegionServer JMX metrics:

        }, {
          "name" : "java.lang:type=GarbageCollector,name=G1 Young Generation",
          "modelerType" : "sun.management.GarbageCollectorImpl",
          "LastGcInfo" : {
            "GcThreadCount" : 24,
            "duration" : 120,
      
        }, {
          "name" : "java.lang:type=GarbageCollector,name=G1 Old Generation",
          "modelerType" : "sun.management.GarbageCollectorImpl",
          "LastGcInfo" : {
            "GcThreadCount" : 24,
            "duration" : 19641,
      

      Yes this old gen GC is atrocious which is why I'm here to tune this, but it helps if this stuff is monitored properly in the first place to know there is a problem without waiting until there are random RegionServer deaths due to long GC pauses.

      Right now Ambari's Grafana has GCTimeMillis which would make one think this is not a problem as it only shows an averaged out 40ms per sec of GC time which isn't very helpful to spotting this long GC pause problem.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                harisekhon Hari Sekhon
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: