Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-24306

Ambari Metrics + Grafana - add LastGcInfo duration graphs for all server components for all GCs - G1GC Young + Old Gens, CMS and ParallelNew

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • ambari-metrics, metrics
    • None

    Description

      Feature Request to add Grafana graph of last value (not average please) LastGcInfo duration for all 3 major garbage collectors :

      • G1GC Young Gen
      • G1GC Old Generations
      • CMS
      • ParallelNew

      CMS and ParNew example taken from NameNode JMX metrics:

        }, {
          "name" : "java.lang:type=GarbageCollector,name=ConcurrentMarkSweep",
          "modelerType" : "sun.management.GarbageCollectorImpl",
          "LastGcInfo" : {
            "GcThreadCount" : 11,
            "duration" : 5206,
      ...
        }, {
          "name" : "java.lang:type=GarbageCollector,name=ParNew",
          "modelerType" : "sun.management.GarbageCollectorImpl",
          "LastGcInfo" : {
            "GcThreadCount" : 11,
            "duration" : 6,
       

      G1GC Young and Old Gen example taken from RegionServer JMX metrics:

        }, {
          "name" : "java.lang:type=GarbageCollector,name=G1 Young Generation",
          "modelerType" : "sun.management.GarbageCollectorImpl",
          "LastGcInfo" : {
            "GcThreadCount" : 24,
            "duration" : 120,
      
        }, {
          "name" : "java.lang:type=GarbageCollector,name=G1 Old Generation",
          "modelerType" : "sun.management.GarbageCollectorImpl",
          "LastGcInfo" : {
            "GcThreadCount" : 24,
            "duration" : 19641,
      

      Yes this old gen GC is atrocious which is why I'm here to tune this, but it helps if this stuff is monitored properly in the first place to know there is a problem without waiting until there are random RegionServer deaths due to long GC pauses.

      Right now Ambari's Grafana has GCTimeMillis which would make one think this is not a problem as it only shows an averaged out 40ms per sec of GC time which isn't very helpful to spotting this long GC pause problem.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              harisekhon Hari Sekhon
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: