[YARN-10642] Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995 - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.2.1
Fix Version/s: 3.4.0, 3.3.1, 3.2.3
Component/s: resourcemanager
Labels:
None

Target Version/s:

3.4.0
External issue ID:
~~YARN-8995~~

Description

In our cluster, ResouceManager stuck twice within twenty days. Yarn client can't submit application. I got jstack info at second time, then found the reason.
I analyze all the jstack, I found many thread stuck because can't get LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the analytical process)
The reason is that one thread hold the putLock all the time, printEventQueueDetails will called forEachRemaining, then hold putLock and readLock. The AsyncDispatcher will stuck.

Thread 6526 (IPC Server handler 454 on default port 8030):
  State: RUNNABLE
  Blocked count: 29988
  Waited count: 2035029
  Stack:
    java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
    java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
    org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
    org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
    org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
    org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
    org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
    org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
    org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
    org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
    org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
    java.security.AccessController.doPrivileged(Native Method)

I analyze LinkedBlockingQueue's source code. I found forEachRemaining in LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take are called in different thread.
~~YARN-8995~~ introduce printEventQueueDetails method, "eventQueue.stream().collect" will called forEachRemaining method.

Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that how to take()。Specical Node: The removed Node will point itself for help gc!!!
The key point code is in forEachRemaining, we see LBQSpliterator use forEachRemaining to visit all Node. But when got item value from Node, will release the lock. If at this time, take() will be called.
The variable 'p' in forEachRemaining may point a Node which point itself, then forEachRemaining will be in dead loop. You can see it in "deadloop.png"

Let's see a simple uni-test, Let's forEachRemaining called more slow than take, the problem will reproduction。uni-test is MockForDeadLoop.java.

I debug MockForDeadLoop.java, and see a Node point itself. You can see pic "debugfornode.png"

Environment:
OS: CentOS Linux release 7.5.1804 (Core)
JDK: jdk1.8.0_281