Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-10111

Tune GC settings for HBase daemon

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.0.0
    • ambari-metrics
    • None

    Description

      After looking at HBase logs and GC logs, discovered that:
      1. RS stopped due to unreachable ZK, the cause based on logs is long gc pause:

      2015-03-17 15:34:33,810 WARN  [M:0;perf-400-10mar-b-ams-1:48488.oldLogCleaner] util.Sleeper: We slept 78463ms instead of 60000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
      2015-03-17 15:34:33,810 WARN  [RS:0;perf-400-10mar-b-ams-1:38900.compactionChecker] util.Sleeper: We slept 26609ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
      2015-03-17 15:34:33,810 WARN  [RS:0;perf-400-10mar-b-ams-1:38900.periodicFlusher] util.Sleeper: We slept 26608ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
      

      2. Looking at gc logs, it seems everything is nice and clean and a sudden unexpected event as seen here:

      2015-03-17T15:34:10.908+0000: 47483.945: [GC2015-03-17T15:34:10.908+0000: 47483.945: [ParNew: 613440K->68096K(613440K), 0.0645970 secs] 10961428K->10444812K(12514816K), 0.0648180 secs] [Times: user=0.34 sys=0.01, real=0.06 secs]
      2015-03-17T15:34:11.525+0000: 47484.562: [GC2015-03-17T15:34:11.525+0000: 47484.562: [ParNew: 613424K->68096K(613440K), 0.0585420 secs] 10956776K->10438129K(12514816K), 0.0590000 secs] [Times: user=0.31 sys=0.02, real=0.06 secs]
      2015-03-17T15:34:12.133+0000: 47485.170: [GC2015-03-17T15:34:12.134+0000: 47485.170: [ParNew (promotion failed): 613346K->613346K(613440K), 0.1492450 secs]2015-03-17T15:34:12.283+0000: 47485.320: [CMS2015-03-17T15:34:28.879+0000: 47501.916: [CMS-concurrent-sweep: 33.271/35.389 secs] [Times: user=144.69 sys=3.76, real=35.39 secs]
       (concurrent mode failure): 10365250K->4753155K(11901376K), 21.5240950 secs] 10955450K->4753155K(12514816K), [CMS Perm : 41519K->41519K(69388K)], 21.6736450 secs] [Times: user=21.93 sys=0.02, real=21.67 secs]
      2015-03-17T15:34:34.584+0000: 47507.621: [GC2015-03-17T15:34:34.584+0000: 47507.621: [ParNew: 545344K->59138K(613440K), 0.0552630 secs] 5298499K->4812293K(12514816K), 0.0554990 secs] [Times: user=0.32 sys=0.02, real=0.06 secs]
      

      The possible reason is GCE VM paused for more that a minute leading to the unexpected event.

      Recommendations:

      • Tune GC settings for the HBase daemon.
      • Create a Canary script to check health of the system and restart the underlying daemons if needed. Creating a Jira for 2.1.

      Attachments

        Activity

          People

            swagle Siddharth Wagle
            swagle Siddharth Wagle
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: