Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-24244

Grafana HBase GC Time graph wrong / misleading - hiding large GC pauses ~ 2 dozen secs!



    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.5.2
    • Fix Version/s: None
    • Component/s: ambari-metrics, metrics
    • Labels:


      Ambari's in-built Grafana graph for "JVM GC Times" graph in the HBase - RegionServers dashboard is very wrong and doesn't reflect the times I've grepped across HBase RegionServer logs for util.JvmPauseMonitor.

      I've inherited a very heavily loaded HBase + OpenTSDB cluster where there are RegionServer losses occurring due to GCs around 30 seconds causing ZK + HMaster to declare them dead. The Grafana graphs show peaks around 70ms due to averaging the GC time spent over all seconds, which smooths out the peaks so as to not show any problem. If you are going to use GCTimeMillis then I believe you need to divide by GCCount.

      Otherwise I believe this is actually the wrong metric to be watching and instead the following metric from HBase JMX should be monitored with a value of last. This does show the significant GC time spent:

      java.lang:type=GarbageCollector,name=G1 Old Generation -> LastGcInfo -> duration

      Obviously make it search for a regex to match whichever garbage collector you are using, whether G1 or CMS etc:

      java.lang:type=GarbageCollector,name=.*Old Gen.*  -> LastGcInfo -> duration

      Right now the GC Times graph is worse than useless, it's misleading as it implies there are no GC issues when there are actually very large very severe GC issues on this cluster.

      This is a vanilla Ambari deployed Grafana with Ambari Metrics.


          Issue Links



              • Assignee:
                harisekhon Hari Sekhon
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: