Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
Current config set under safety valve for ca cert rotation :
<property><name>hdds.x509.max.duration</name><value>P5D</value></property>
<property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
<property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
<property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
<property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
based on above config the certificates should renew at 10th minutes from start by the grace period coming into play.
Observation :
Certificates were renewed for a few roles or instances at random.
Primordial SCM was stopped and started at the below times :
2023-07-21 15:36:03,703 : STOPPED
2023-07-21 16:00:25,736 : STARTED
[root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
66603508550073199 Fri Jul 21 15:02:32 UTC 2023 Fri Jul 21 16:02:32 UTC 2023 CN=dn@quasar-kkgvze-6.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509087676398 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023 CN=dn@quasar-kkgvze-5.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509333666132 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023 CN=dn@quasar-kkgvze-7.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509791941467 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023 CN=dn@quasar-kkgvze-2.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510052147817 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023 CN=dn@quasar-kkgvze-3.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510180218825 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023 CN=dn@quasar-kkgvze-9.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510614053924 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023 CN=dn@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603512588421508 Fri Jul 21 15:02:36 UTC 2023 Fri Jul 21 16:02:36 UTC 2023 CN=dn@quasar-kkgvze-8.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604108035744075 Fri Jul 21 15:12:32 UTC 2023 Fri Jul 21 16:12:32 UTC 2023 CN=dn@quasar-kkgvze-6.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109383514945 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023 CN=dn@quasar-kkgvze-5.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109450607224 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023 CN=dn@quasar-kkgvze-7.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109820303351 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023 CN=dn@quasar-kkgvze-2.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604110216800118 Fri Jul 21 15:12:34 UTC 2023 Fri Jul 21 16:12:34 UTC 2023 CN=dn@quasar-kkgvze-9.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
UPDATE with Root cause:
After checking first, in the thread dump of the leader SCM, we saw these threads that are interesting:
"main" #1 prio=5 os_prio=0 cpu=31356.46ms elapsed=206374.58s allocated=515M defined_classes=8384 tid=0x00007f5164015800 nid=0x208099 runnable [0x00007f516b4e9000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPoll.wait(java.base@11.0.3/Native Method) at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.3/EPollSelectorImpl.java:120) at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.3/SelectorImpl.java:124) - locked <0x0000000730c03f58> (a sun.nio.ch.Util$2) - locked <0x000000072afbf058> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(java.base@11.0.3/SelectorImpl.java:136) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:202) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730) - locked <0x00000007058d3a20> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843) - locked <0x00000007058d3a20> (a org.apache.hadoop.ipc.Client$Connection) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681) at org.apache.hadoop.ipc.Client.call(Client.java:1506) at org.apache.hadoop.ipc.Client.call(Client.java:1459) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy14.submitRequest(Unknown Source) at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.3/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@11.0.3/Method.java:566) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) - locked <0x000000072d460c08> (a org.apache.hadoop.io.retry.RetryInvocationHandler$Call) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) at com.sun.proxy.$Proxy14.submitRequest(Unknown Source) at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102) at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374) at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:959) - locked <0x000000072a05b650> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient) at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:951) - locked <0x000000072a05b650> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.persistSCMCertificates(StorageContainerManager.java:1557) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.start(StorageContainerManager.java:1515) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:172) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48) at picocli.CommandLine.executeUserObject(CommandLine.java:1953) at picocli.CommandLine.access$1300(CommandLine.java:145) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352) at picocli.CommandLine$RunLast.handle(CommandLine.java:2346) at picocli.CommandLine$RunLast.handle(CommandLine.java:2311) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) at picocli.CommandLine.execute(CommandLine.java:2078) at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100) at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
"IPC Server handler 0 on 9961" #263 daemon prio=5 os_prio=0 cpu=3568.63ms elapsed=206369.56s allocated=11121K defined_classes=216 tid=0x00007f5166457800 nid=0x2082b5 waiting for monitor entry [0x00007f511fba8000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getCACertificate(DefaultCertificateClient.java:370) - waiting to lock <0x000000072a05b650> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient) at org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:491) - locked <0x000000072cf80778> (a org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown Source) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93) at org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922) at java.security.AccessController.doPrivileged(java.base@11.0.3/Native Method) at javax.security.auth.Subject.doAs(java.base@11.0.3/Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
"IPC Server handler 1 on 9961" #264 daemon prio=5 os_prio=0 cpu=3501.59ms elapsed=206369.56s allocated=1728K defined_classes=3 tid=0x00007f5166459800 nid=0x2082b6 waiting for monitor entry [0x00007f511faa7000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:479) - waiting to lock <0x000000072cf80778> (a org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown Source) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87) at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93) at org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922) at java.security.AccessController.doPrivileged(java.base@11.0.3/Native Method) at javax.security.auth.Subject.doAs(java.base@11.0.3/Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
Looking at it first these are not trivially causing a dead-lock, but further analysis of the environment shows that:
- main thread has locked on SCMCertificateClient instance as part of the listCA call. Lock object id: 0x000000072a05b650
- The server handler #0 for port 9961 waits on locking on SCMCertificateClient as part of the getRootCACertificate while it has already locked SCMSecurityProtocolServer in setRootCAIfNeeded, lock object ID is 0x000000072cf80778
- The server handler #1 for port 9961 waits on locking on SCMSecurityProtocolServer in the same setRootCAIfNeeded, lock object ID is 0x000000072cf80778
with that we have depleted the handler threads for the SCMSecurityProtocolServer (by default we have 2), so main thread can not submit the request, and as the retry policy is infinite retry, it waits for some time in between retries, and that is the EPoll wait on which the main thread is waiting in RUNNABLE state, while the other two threads are BLOCKED.
This situation in our test environment came up during a cluster restart after some SCM node decommission then a new SCM node added, and this restart was taking time, during which service certificates entered the grace period that marks their end of life, and restarting services started to renew their certificates while the leader SCM started up, so with that within the leader SCM, we got into the detailed state. This pretty much blocked the whole startup, certificates could not be renewed, and with that the cluster became unusable.