HBase
  1. HBase
  2. HBASE-8416

Region Server Spinning on JMX requests

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Won't Fix
    • Affects Version/s: 0.94.4
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      This morning one our region servers (we have 44) stopped responding to
      the '/jmx' request. (It's working for regular activity.) Additionally,
      the region server is now using all the CPU on the host, running all 8
      cores at 100%.

      A full jstack is at:
      http://pastebin.com/dGTmTEN7

      Right now, there are 37 threads stuck here:
      "38565532@qtp-228776471-196" prio=10 tid=0x00002aaacc4f2800 nid=0x7f57 runnable [0x0000000054a48000]
      java.lang.Thread.State: RUNNABLE
      at java.util.HashMap.get(HashMap.java:303)
      at org.apache.hadoop.metrics.util.MetricsDynamicMBeanBase.getAttribute(MetricsDynamicMBeanBase.java:137)
      at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:666)
      at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
      at org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:315)
      at org.apache.hadoop.jmx.JMXJsonServlet.listBeans(JMXJsonServlet.java:293)
      at org.apache.hadoop.jmx.JMXJsonServlet.doGet(JMXJsonServlet.java:193)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:734)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
      at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
      at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
      at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
      at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1056)
      at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
      at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
      at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
      at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
      at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
      at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
      at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
      at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
      at org.mortbay.jetty.Server.handle(Server.java:326)
      at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
      at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
      at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
      at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
      at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
      at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
      at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

        Issue Links

          Activity

          Hide
          Liang Xie added a comment -

          Per my understanding, it's a hadoop-common issue, not a HBASE issue.
          Ron Buckley, could you provide your HDFS version? so we can corelate the right source code with the above strack trace.
          ps: it should be caused by concurrent writing to HashMap w/o synchronized. i guess it's "metricsRateAttributeMod". two choice at least:
          1) like MetricsRegistry style, add "synchronized"
          2) use ConcurrentHashMap

          Show
          Liang Xie added a comment - Per my understanding, it's a hadoop-common issue, not a HBASE issue. Ron Buckley , could you provide your HDFS version? so we can corelate the right source code with the above strack trace. ps: it should be caused by concurrent writing to HashMap w/o synchronized. i guess it's "metricsRateAttributeMod". two choice at least: 1) like MetricsRegistry style, add "synchronized" 2) use ConcurrentHashMap
          Hide
          Ron Buckley added a comment -

          We're running CDH 4.1.1, which lists 2.0.0-mr1-cdh4.1.1

          Show
          Ron Buckley added a comment - We're running CDH 4.1.1, which lists 2.0.0-mr1-cdh4.1.1
          Hide
          Anoop Sam John added a comment - - edited

          Yes Liang Xie What you say can be the root cuase. 100% CPU usage indicates a HashMap corruption because of non thread safe access.. We have fixed some thing like this in Trunk code metrics part recently.(That is in HBase code) So can check with hadoop-common metrics part.
          See the access of the HashMap referred in org.apache.hadoop.metrics.util.MetricsDynamicMBeanBase.getAttribute() in all places and make sure it is synchronized properly

          Show
          Anoop Sam John added a comment - - edited Yes Liang Xie What you say can be the root cuase. 100% CPU usage indicates a HashMap corruption because of non thread safe access.. We have fixed some thing like this in Trunk code metrics part recently.(That is in HBase code) So can check with hadoop-common metrics part. See the access of the HashMap referred in org.apache.hadoop.metrics.util.MetricsDynamicMBeanBase.getAttribute() in all places and make sure it is synchronized properly
          Hide
          Liang Xie added a comment -

          correct, MetricsDynamicMBeanBase.java:137 under cdh4.1.1 src shows:

          Object o = metricsRateAttributeMod.get(attributeName);
          
          Show
          Liang Xie added a comment - correct, MetricsDynamicMBeanBase.java:137 under cdh4.1.1 src shows: Object o = metricsRateAttributeMod.get(attributeName);
          Hide
          Anoop Sam John added a comment -

          Non synchronized usage of HashMap can result in making the linkedlist within a bucket to become a loop and cause endless loop during get(So 100% CPU).. We have seen such cases in the past.

          Show
          Anoop Sam John added a comment - Non synchronized usage of HashMap can result in making the linkedlist within a bucket to become a loop and cause endless loop during get(So 100% CPU).. We have seen such cases in the past.
          Hide
          Andrew Purtell added a comment - - edited

          I agree with Anoop. In that pastebin looks like a bunch of threads are spinning in HashMap#get (called from org.apache.hadoop.metrics.util.MetricsDynamicMBeanBase.getAttribute) but one is also spinning in HashMap#put (called from org.apache.hadoop.metrics.util.MetricsDynamicMBeanBase.createMBeanInfo). I would theorize there was a race between concurrent puts and afterwards any thread that touches the bucket is captured into an infinite loop.

          Show
          Andrew Purtell added a comment - - edited I agree with Anoop. In that pastebin looks like a bunch of threads are spinning in HashMap#get (called from org.apache.hadoop.metrics.util.MetricsDynamicMBeanBase.getAttribute) but one is also spinning in HashMap#put (called from org.apache.hadoop.metrics.util.MetricsDynamicMBeanBase.createMBeanInfo). I would theorize there was a race between concurrent puts and afterwards any thread that touches the bucket is captured into an infinite loop.
          Hide
          Lars Hofhansl added a comment -

          This is bad... Making critical.

          Show
          Lars Hofhansl added a comment - This is bad... Making critical.
          Hide
          Anoop Sam John added a comment -

          Per my understanding, it's a hadoop-common issue, not a HBASE issue.

          So how we plan to solve this issue?

          Show
          Anoop Sam John added a comment - Per my understanding, it's a hadoop-common issue, not a HBASE issue. So how we plan to solve this issue?
          Hide
          ramkrishna.s.vasudevan added a comment -

          Should HBase take care of the synchronization here? Else it would be a fix that is needed in hadoop common.

          Show
          ramkrishna.s.vasudevan added a comment - Should HBase take care of the synchronization here? Else it would be a fix that is needed in hadoop common.
          Hide
          Liang Xie added a comment -

          I filed HADOOP-9504 to resolve it under hadoop-common layer.
          But it would be nice/safe if we(HBase) does extra work i am trying to search relatived code under HBASE src.

          Show
          Liang Xie added a comment - I filed HADOOP-9504 to resolve it under hadoop-common layer. But it would be nice/safe if we(HBase) does extra work i am trying to search relatived code under HBASE src.
          Hide
          Liang Xie added a comment -

          Oh, i found this from HttpServer.java under hadoop-common source:

            protected void addDefaultServlets() {
              // set up default servlets
              addServlet("stacks", "/stacks", StackServlet.class);
              addServlet("logLevel", "/logLevel", LogLevel.Servlet.class);
              addServlet("metrics", "/metrics", MetricsServlet.class);
              addServlet("jmx", "/jmx", JMXJsonServlet.class);
              addServlet("conf", "/conf", ConfServlet.class);
            }
          

          emmm, really a hadoop-common issue, seems we(HBase) could not control more on it.
          I think we can close this JIRA now...
          Lars Hofhansl,Anoop Sam John,Andrew Purtell,ramkrishna.s.vasudevan,Ron Buckley, any thoughts? thanks.

          Show
          Liang Xie added a comment - Oh, i found this from HttpServer.java under hadoop-common source: protected void addDefaultServlets() { // set up default servlets addServlet( "stacks" , "/stacks" , StackServlet.class); addServlet( "logLevel" , "/logLevel" , LogLevel.Servlet.class); addServlet( "metrics" , "/metrics" , MetricsServlet.class); addServlet( "jmx" , "/jmx" , JMXJsonServlet.class); addServlet( "conf" , "/conf" , ConfServlet.class); } emmm, really a hadoop-common issue, seems we(HBase) could not control more on it. I think we can close this JIRA now... Lars Hofhansl , Anoop Sam John , Andrew Purtell , ramkrishna.s.vasudevan , Ron Buckley , any thoughts? thanks.
          Hide
          Anoop Sam John added a comment -

          +1 we need fix in hadoop common

          Show
          Anoop Sam John added a comment - +1 we need fix in hadoop common
          Hide
          Ron Buckley added a comment -

          We're good with the fix in hadoop common.

          Show
          Ron Buckley added a comment - We're good with the fix in hadoop common.
          Hide
          ramkrishna.s.vasudevan added a comment -

          +1 on fixing hadoop-common issue.

          Show
          ramkrishna.s.vasudevan added a comment - +1 on fixing hadoop-common issue.
          Hide
          Andrew Purtell added a comment -

          Resolved as wontfix, see HADOOP-9504

          Show
          Andrew Purtell added a comment - Resolved as wontfix, see HADOOP-9504
          Hide
          Lars Hofhansl added a comment -

          Meh... Agreed, nothing we can do on the HBase side.

          Show
          Lars Hofhansl added a comment - Meh... Agreed, nothing we can do on the HBase side.
          Hide
          Elliott Clark added a comment -

          I think our modification of extended attributes could also have this issue. So I think we should synchronize MetricsMBeanBase#init. Then our hash map won't have this issue. The hash map in hadoop common will still be an issue but we need to fix both.

          Show
          Elliott Clark added a comment - I think our modification of extended attributes could also have this issue. So I think we should synchronize MetricsMBeanBase#init. Then our hash map won't have this issue. The hash map in hadoop common will still be an issue but we need to fix both.
          Hide
          Liang Xie added a comment -

          Elliott Clark, your are right, it's probably a potential bomb there

          do we need to file another JIRA or just attach patch on here? i prefer to a new JIRA, cause the error stack will be different with the above's "/jmx request". the change shold be like:

            protected Map<String,MetricsBase> extendedAttributes =
                new ConcurrentHashMap<String,MetricsBase>();
          
          Show
          Liang Xie added a comment - Elliott Clark , your are right, it's probably a potential bomb there do we need to file another JIRA or just attach patch on here? i prefer to a new JIRA, cause the error stack will be different with the above's "/jmx request". the change shold be like: protected Map< String ,MetricsBase> extendedAttributes = new ConcurrentHashMap< String ,MetricsBase>();

            People

            • Assignee:
              Unassigned
              Reporter:
              Ron Buckley
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development