Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
None
Description
When we use the SCM benchmark tool (https://issues.apache.org/jira/browse/HDDS-5702) to test SCM throughput for AllocateContainer, we found a dramatic degradation in the throughput when one scm peer(follower) is down.
Here is some statistics.
cmd:
./bin/ozone freon stb \ --benchmark AllocateContainers \ --scmHost 9.44.15.145 \ --num-containers 500000 \ --num-datanodes 50 \ --num-threads 50
Normal Case:
*************************************** Total allocated containers: 500000 Total failed containers: 0 Execution Time: 02:36:50,151 Throughput: 53.000000 (ops) ***************************************
One scm follower down:
*************************************** Total allocated containers: 50000 Total failed containers: 0 Execution Time: 02:22:00,245 Throughput: 5.000000 (ops) ***************************************
The overall throughput drops to 1/10 of the original.
We have a dig into this problem.
There are 2 flame graphs captured by an open source tool (arthas) before and after the scm follower down in the attachments
We could see that the GrpcLogAppender related to the down scm follower is consuming the most cpu busy reading segment files from the disk.
At the same time we could see that the metrics `ratis_log_worker_cacheMissCount` and `ratis_grpc_log_appender_num_retries` are increasing.
So a down scm follower leads to segmented raft log cache misses, and GrpcLogAppender got to load cache from the disk.
But actually we are reading disk files with an intrinsic lock held:
// GrpcLogAppender.java private void appendLog(boolean excludeLogEntries) throws IOException { final AppendEntriesRequestProto pending; final AppendEntriesRequest request; final StreamObserver<AppendEntriesRequestProto> s; synchronized (this) { <-- will block others // prepare and enqueue the append request. note changes on follower's // nextIndex and ops on pendingRequests should always be associated // together and protected by the lock pending = newAppendEntriesRequest(callId++, excludeLogEntries); <-- segment file read inside if (pending == null) { return; } request = new AppendEntriesRequest(pending, getFollowerId(), grpcServerMetrics); pendingRequests.put(request); increaseNextIndex(pending); if (appendLogRequestObserver == null) { appendLogRequestObserver = getClient().appendEntries(new AppendLogResponseHandler()); } s = appendLogRequestObserver; } if (isRunning()) { sendRequest(request, pending, s); } }
Unluckily, when RaftServer submit a client request, it will notify the GrpcLogAppender under the same lock:
// LogAppender.java default void notifyLogAppender() { synchronized (this) { <-- blocked by others notify(); } }
So finally the slow disk file reading operation blocks the client request submission.
This could be detected with arthas as follows:
[arthas@85790]$ thread -b "IPC Server handler 112 on default port 9860" Id=168 BLOCKED on org.apache.ratis.grpc.server.GrpcLogAppender@68e40b62 owned by "fc19acc0-d5cb-4e64-a1b3-a107fe38d6a7@group-F43758AD903B->d1fa1468-5d66-4c17-be19-493b372ee241-GrpcLogAppender-LogAppenderDaemon" Id=691 at org.apache.ratis.server.leader.LogAppender.notifyLogAppender(LogAppender.java:133) - blocked on org.apache.ratis.grpc.server.GrpcLogAppender@68e40b62 at org.apache.ratis.server.impl.LeaderStateImpl$$Lambda$421/1164237000.accept(Unknown Source) at java.util.concurrent.CopyOnWriteArrayList.forEach(CopyOnWriteArrayList.java:891) at org.apache.ratis.server.impl.LeaderStateImpl$SenderList.forEach(LeaderStateImpl.java:203) at org.apache.ratis.server.impl.LeaderStateImpl.notifySenders(LeaderStateImpl.java:321) at org.apache.ratis.server.impl.RaftServerImpl.appendTransaction(RaftServerImpl.java:723) - locked org.apache.ratis.server.impl.RaftServerImpl@4a0e7199 at org.apache.ratis.server.impl.RaftServerImpl.submitClientRequestAsync(RaftServerImpl.java:800) at org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$9(RaftServerProxy.java:417) at org.apache.ratis.server.impl.RaftServerProxy$$Lambda$525/1313234420.apply(Unknown Source) at org.apache.ratis.server.impl.RaftServerProxy.lambda$null$7(RaftServerProxy.java:412) at org.apache.ratis.server.impl.RaftServerProxy$$Lambda$527/34265965.get(Unknown Source) at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:115) at org.apache.ratis.server.impl.RaftServerProxy.lambda$submitRequest$8(RaftServerProxy.java:412) at org.apache.ratis.server.impl.RaftServerProxy$$Lambda$526/1858964023.apply(Unknown Source) at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981) at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2124) at org.apache.ratis.server.impl.RaftServerProxy.submitRequest(RaftServerProxy.java:411) at org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:417) at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:212) at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110) at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67) at com.sun.proxy.$Proxy15.addContainer(Unknown Source) at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.allocateContainer(ContainerManagerImpl.java:229) at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.allocateContainer(ContainerManagerImpl.java:198) at org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.allocateContainer(SCMClientProtocolServer.java:201) at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:414) at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:182) at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB$$Lambda$582/601306741.apply(Unknown Source) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87) at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:169) at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java:55800) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1024) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:948) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2002) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2993) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@18fbf528 <---- but blocks 49 other threads!
Actually there are several problems here in general:
1.The LogAppender should not do heavy works(disk file reading) with a lock held for somewhat meaningless job, we could known that the follower has dead and prevent it from happening.
2.The client handler thread should not be blocked by the LogAppender, there should be more light-weight way to do a notify of new jobs, such as a classic producer-consumer pattern with queues.
This should be a ratis side problem according to the trace.
Attachments
Attachments
Issue Links
- is related to
-
RATIS-1399 should notify all the log senders Asynchronously
- Resolved