[HDDS-5703] SCM HA performance degradation upon one peer down. - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

When we use the SCM benchmark tool (https://issues.apache.org/jira/browse/HDDS-5702) to test SCM throughput for AllocateContainer, we found a dramatic degradation in the throughput when one scm peer(follower) is down.

Here is some statistics.

cmd:

./bin/ozone freon stb                \
    --benchmark AllocateContainers   \
    --scmHost 9.44.15.145            \
    --num-containers 500000          \
    --num-datanodes 50               \
    --num-threads 50

Normal Case:

***************************************
Total allocated containers: 500000
Total failed containers: 0
Execution Time: 02:36:50,151
Throughput: 53.000000 (ops)
***************************************

One scm follower down:

***************************************
Total allocated containers: 50000
Total failed containers: 0
Execution Time: 02:22:00,245
Throughput: 5.000000 (ops)
***************************************

The overall throughput drops to 1/10 of the original.

We have a dig into this problem.

There are 2 flame graphs captured by an open source tool (arthas) before and after the scm follower down in the attachments

20210830-183345.svg

20210830-184833.svg.

We could see that the GrpcLogAppender related to the down scm follower is consuming the most cpu busy reading segment files from the disk.

At the same time we could see that the metrics `ratis_log_worker_cacheMissCount` and `ratis_grpc_log_appender_num_retries` are increasing.

So a down scm follower leads to segmented raft log cache misses, and GrpcLogAppender got to load cache from the disk.

But actually we are reading disk files with an intrinsic lock held:

// GrpcLogAppender.java
private void appendLog(boolean excludeLogEntries) throws IOException {
  final AppendEntriesRequestProto pending;
  final AppendEntriesRequest request;
  final StreamObserver<AppendEntriesRequestProto> s;
  synchronized (this) {                                                             <-- will block others
    // prepare and enqueue the append request. note changes on follower's
    // nextIndex and ops on pendingRequests should always be associated
    // together and protected by the lock
    pending = newAppendEntriesRequest(callId++, excludeLogEntries);                 <-- segment file read inside
    if (pending == null) {
      return;
    }
    request = new AppendEntriesRequest(pending, getFollowerId(), grpcServerMetrics);
    pendingRequests.put(request);
    increaseNextIndex(pending);
    if (appendLogRequestObserver == null) {
      appendLogRequestObserver = getClient().appendEntries(new AppendLogResponseHandler());
    }
    s = appendLogRequestObserver;
  }

  if (isRunning()) {
    sendRequest(request, pending, s);
  }
}

Unluckily, when RaftServer submit a client request, it will notify the GrpcLogAppender under the same lock:

// LogAppender.java
default void notifyLogAppender() {
  synchronized (this) {                            <-- blocked by others
    notify();
  }
}

So finally the slow disk file reading operation blocks the client request submission.

This could be detected with arthas as follows:

[arthas@85790]$ thread -b
"IPC Server handler 112 on default port 9860" Id=168 BLOCKED on org.apache.ratis.grpc.server.GrpcLogAppender@68e40b62 owned by "fc19acc0-d5cb-4e64-a1b3-a107fe38d6a7@group-F43758AD903B->d1fa1468-5d66-4c17-be19-493b372ee241-GrpcLogAppender-LogAppenderDaemon" Id=691
    at org.apache.ratis.server.leader.LogAppender.notifyLogAppender(LogAppender.java:133)
    -  blocked on org.apache.ratis.grpc.server.GrpcLogAppender@68e40b62
    at org.apache.ratis.server.impl.LeaderStateImpl$$Lambda$421/1164237000.accept(Unknown Source)
    at java.util.concurrent.CopyOnWriteArrayList.forEach(CopyOnWriteArrayList.java:891)
    at org.apache.ratis.server.impl.LeaderStateImpl$SenderList.forEach(LeaderStateImpl.java:203)
    at org.apache.ratis.server.impl.LeaderStateImpl.notifySenders(LeaderStateImpl.java:321)
    at org.apache.ratis.server.impl.RaftServerImpl.appendTransaction(RaftServerImpl.java:723)
    -  locked org.apache.ratis.server.impl.RaftServerImpl@4a0e7199
    at org.apache.ratis.server.impl.RaftServerImpl.submitClientRequestAsync(RaftServerImpl.java:800)
    at org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$9(RaftServerProxy.java:417)
    at org.apache.ratis.server.impl.RaftServerProxy$$Lambda$525/1313234420.apply(Unknown Source)
    at org.apache.ratis.server.impl.RaftServerProxy.lambda$null$7(RaftServerProxy.java:412)
    at org.apache.ratis.server.impl.RaftServerProxy$$Lambda$527/34265965.get(Unknown Source)
    at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:115)
    at org.apache.ratis.server.impl.RaftServerProxy.lambda$submitRequest$8(RaftServerProxy.java:412)
    at org.apache.ratis.server.impl.RaftServerProxy$$Lambda$526/1858964023.apply(Unknown Source)
    at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:981)
    at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2124)
    at org.apache.ratis.server.impl.RaftServerProxy.submitRequest(RaftServerProxy.java:411)
    at org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:417)
    at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:212)
    at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
    at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
    at com.sun.proxy.$Proxy15.addContainer(Unknown Source)
    at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.allocateContainer(ContainerManagerImpl.java:229)
    at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.allocateContainer(ContainerManagerImpl.java:198)
    at org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer.allocateContainer(SCMClientProtocolServer.java:201)
    at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:414)
    at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.processRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:182)
    at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB$$Lambda$582/601306741.apply(Unknown Source)
    at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
    at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:169)
    at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java:55800)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1024)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:948)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2002)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2993)    Number of locked synchronizers = 1
    - java.util.concurrent.locks.ReentrantLock$NonfairSync@18fbf528 <---- but blocks 49 other threads!

Actually there are several problems here in general:

1.The LogAppender should not do heavy works(disk file reading) with a lock held for somewhat meaningless job, we could known that the follower has dead and prevent it from happening.

2.The client handler thread should not be blocked by the LogAppender, there should be more light-weight way to do a notify of new jobs, such as a classic producer-consumer pattern with queues.

This should be a ratis side problem according to the trace.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

20210830-183345.svg
31/Aug/21 09:48
940 kB
Mark Gui
20210830-184833.svg
31/Aug/21 09:48
242 kB
Mark Gui

Issue Links

is related to

RATIS-1399 should notify all the log senders Asynchronously

Resolved

SCM HA performance degradation upon one peer down.

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates