Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10221

Nodemanager lockups on printEventQueueDetails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.1
    • None
    • None
    • None

    Description

      We are seeing a rare, but critical bug on our production clusters running hadoop 3.2.1. The central issue is that the NodeManager is locked up trying to print details about the event queues. This feature was added in YARN-8995

      The main symptoms are:

      • Containers stuck in an Initing phase (ContainersIniting in jmx)
      • NM stops accepting RPC calls

      Failed job submissions manifest as socket timeouts to the RPC port:

      INFO - diagnostics: Application application_1585693823779_0028 failed 1 times (global limit =2; local limit is =1) due to Error launching appattempt_1585693823779_0028_000001. Got exception: java.net.SocketTimeoutException: Call From hadoopresourcesec--0c94ac2238c29f40e.production/10.68.12.37 to hadoopdatanodei--06bad095f795f0725.production:8039 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.68.12.37:59892 remote=hadoopdatanodei--06bad095f795f0725.production/10.68.58.224:8039]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
      

      Relevant outputs from jstack -l: on an affected NodeManager. All IPC threads are blocked waiting on the lock on the eventQueue

      Thread printing event queue details - this runs indefinitely

      "Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000]"Public Localizer" #62 prio=5 os_prio=0 tid=0x00007f488d948000 nid=0x1cee9 runnable [0x00007f4890571000]   java.lang.Thread.State: RUNNABLE at java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource$FetchSuccessTransition.transition(LocalizedResource.java:243) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) - locked <0x00007f4906f49230> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.handle(LocalizedResource.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:188) - locked <0x00007f48f47a9658> (a org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.handle(LocalResourcesTrackerImpl.java:59) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:982)
      
      Locked ownable synchronizers: - <0x00007f48f5a7a950> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f48f5a7a9a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x00007f4909f25278> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
      

      Sample IPC handler thread (8039 is our NM RPC port). All threads waiting on 0x00007f48f5a7a9a8

      "IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition [0x00007f489107b000]"IPC Server handler 19 on default port 8039" #230 daemon prio=5 os_prio=0 tid=0x00007f488d8e2800 nid=0x1cede waiting on condition [0x00007f489107b000]   java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for  <0x00007f48f5a7a9a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:304) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.sendKillEvent(ContainerImpl.java:1030) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainerInternal(ContainerManagerImpl.java:1439) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainers(ContainerManagerImpl.java:1411) at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainers(ContainerManagementProtocolPBServiceImpl.java:115) at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:225) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915)
         Locked ownable synchronizers: - None
      

       

      Single thread waiting on 0x00007f489016f000

      "NM ContainerManager dispatcher" #243 prio=5 os_prio=0 tid=0x00007f488d145000 nid=0x1ceec waiting on condition [0x00007f489016f000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00007f48f5a7a950> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
      	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
      	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:125)
      	at java.lang.Thread.run(Thread.java:748)
      
         Locked ownable synchronizers:
      	- None
      

      Attachments

        Issue Links

          Activity

            People

              zhuqi Qi Zhu
              jonbender-stripe Jon Bender
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: