Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8414

Nodemanager crashes soon if ATSv2 HBase is either down or absent

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 3.1.0
    • Fix Version/s: None
    • Component/s: yarn
    • Labels:
      None

      Description

      Test cluster has 1000 apps running, and a user trigger capacity scheduler queue changes. This crashes all node managers. It looks like node manager encounter too many files open while aggregating logs for containers:

      2018-06-07 21:17:59,307 WARN  server.AbstractConnector (AbstractConnector.java:handleAcceptFailure(544)) -
      java.io.IOException: Too many open files
              at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
              at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
              at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
              at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371)
              at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
              at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
              at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
              at java.lang.Thread.run(Thread.java:745)
      2018-06-07 21:17:59,758 WARN  util.SysInfoLinux (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; can't determine memory settings
      2018-06-07 21:17:59,758 WARN  util.SysInfoLinux (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; can't determine memory settings
      2018-06-07 21:18:00,842 WARN  client.ConnectionUtils (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, please check your network
      java.net.UnknownHostException: host1.example.com: System error
              at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
              at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
              at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
              at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
              at java.net.InetAddress.getAllByName(InetAddress.java:1192)
              at java.net.InetAddress.getAllByName(InetAddress.java:1126)
              at java.net.InetAddress.getByName(InetAddress.java:1076)
              at org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
              at org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189)
              at org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111)
              at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399)
              at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
              at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

      Timeline service has thousands of exceptions:

      2018-06-07 21:18:34,182 ERROR client.AsyncProcess (AsyncProcess.java:submit(291)) - Failed to get region location
      java.io.InterruptedIOException
              at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265)
              at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
              at org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
              at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
              at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
              at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
              at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
              at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236)
              at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:307)
              at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:212)
              at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:170)
              at org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.mutate(TypedBufferedMutator.java:54)
              at org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:153)
              at org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:107)
              at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.store(HBaseTimelineWriterImpl.java:395)
              at org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:198)
              at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.writeTimelineEntities(TimelineCollector.java:164)
              at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntitiesAsync(TimelineCollector.java:196)
              at org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorWebService.putEntities(TimelineCollectorWebService.java:173)
              at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
              at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
              at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
              at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
              at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
              at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
              at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
              at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
              at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
              at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
              at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
              at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
              at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
              at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
              at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
              at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
              at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
              at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
              at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
              at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
              at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304)
              at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
              at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
              at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
              at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
              at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
              at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
              at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
              at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
              at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
              at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
              at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
              at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
              at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
              at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
              at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
              at org.eclipse.jetty.server.Server.handle(Server.java:534)
              at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
              at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
              at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
              at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
              at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
              at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
              at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
              at java.lang.Thread.run(Thread.java:745)
      2018-06-07 21:18:36,266 INFO  retry.RetryInvocationHandler (RetryInvocationHandler.java:log(411)) - java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "host1.example.com":8020; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking ClientNamenodeProtocolTranslatorPB.getServerDefaults over host1.example.com:8020 after 10 failover attempts. Trying to failover after sleeping for 9634ms.
      2018-06-07 21:18:36,612 WARN  storage.HBaseTimelineWriterImpl (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: flowName=null appId=application_1528316765723_0030 userId=csingh clusterId=yarn-cluster . Not proceeding with writing to hbase
      2018-06-07 21:18:38,396 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=6, retries=6, started=4213 ms ago, cancelled=false, msg=Call to host1.example.com/142.26.32.112:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: host12.example.com/142.26.32.112:17020, details=row 'prod.timelineservice.entity,csingh!yarn-cluster!scale-1-182!^?���(�^@<!^?���)8��^?���!COMPONENT!^@^@^@^@^@^@^@^@!simple,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=host12.example.com,17020,1528302866813, seqNum=-1
      2018-06-07 21:18:38,662 ERROR util.ShutdownHookManager (ShutdownHookManager.java:run(82)) - ShutdownHookManger shutdown forcefully
      

      Nodes were temporarily unable to resolve hostname to IP mapping.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              eyang Eric Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: