Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9061

Possible dead-lock in SCM initialization due to certificate operations.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.4.0
    • SCM

    Description

      Current config set under safety valve for ca cert rotation :

      <property><name>hdds.x509.max.duration</name><value>P5D</value></property>
      <property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
      <property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
      <property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
      <property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
      <property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
      <property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
      

      based on above config the certificates should renew at 10th minutes from start by the grace period coming into play.

      Observation :
      Certificates were renewed for a few roles or instances at random.

      Primordial SCM was stopped and started at the below times :
      2023-07-21 15:36:03,703 : STOPPED
      2023-07-21 16:00:25,736 : STARTED

      [root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
      66603508550073199 Fri Jul 21 15:02:32 UTC 2023   Fri Jul 21 16:02:32 UTC 2023   CN=dn@quasar-kkgvze-6.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66603509087676398 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   CN=dn@quasar-kkgvze-5.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66603509333666132 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   CN=dn@quasar-kkgvze-7.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66603509791941467 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   CN=dn@quasar-kkgvze-2.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66603510052147817 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   CN=dn@quasar-kkgvze-3.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66603510180218825 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   CN=dn@quasar-kkgvze-9.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66603510614053924 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   CN=dn@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66603512588421508 Fri Jul 21 15:02:36 UTC 2023   Fri Jul 21 16:02:36 UTC 2023   CN=dn@quasar-kkgvze-8.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66604108035744075 Fri Jul 21 15:12:32 UTC 2023   Fri Jul 21 16:12:32 UTC 2023   CN=dn@quasar-kkgvze-6.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66604109383514945 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   CN=dn@quasar-kkgvze-5.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66604109450607224 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   CN=dn@quasar-kkgvze-7.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66604109820303351 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   CN=dn@quasar-kkgvze-2.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      66604110216800118 Fri Jul 21 15:12:34 UTC 2023   Fri Jul 21 16:12:34 UTC 2023   CN=dn@quasar-kkgvze-9.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509 CN=scm-sub-66603487342057283@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
      

      UPDATE with Root cause:
      After checking first, in the thread dump of the leader SCM, we saw these threads that are interesting:

      "main" #1 prio=5 os_prio=0 cpu=31356.46ms elapsed=206374.58s allocated=515M defined_classes=8384 tid=0x00007f5164015800 nid=0x208099 runnable  [0x00007f516b4e9000]
         java.lang.Thread.State: RUNNABLE
              at sun.nio.ch.EPoll.wait(java.base@11.0.3/Native Method)
              at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.3/EPollSelectorImpl.java:120)
              at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.3/SelectorImpl.java:124)
              - locked <0x0000000730c03f58> (a sun.nio.ch.Util$2)
              - locked <0x000000072afbf058> (a sun.nio.ch.EPollSelectorImpl)
              at sun.nio.ch.SelectorImpl.select(java.base@11.0.3/SelectorImpl.java:136)
              at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
              at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:202)
              at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
              at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730)
              - locked <0x00000007058d3a20> (a org.apache.hadoop.ipc.Client$Connection)
              at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843)
              - locked <0x00000007058d3a20> (a org.apache.hadoop.ipc.Client$Connection)
              at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430)
              at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681)
              at org.apache.hadoop.ipc.Client.call(Client.java:1506)
              at org.apache.hadoop.ipc.Client.call(Client.java:1459)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
              at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
              at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
              at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.3/DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(java.base@11.0.3/Method.java:566)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
              at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
              at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
              at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
              - locked <0x000000072d460c08> (a org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
              at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
              at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
              at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
              at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:959)
              - locked <0x000000072a05b650> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
              at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:951)
              - locked <0x000000072a05b650> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
              at org.apache.hadoop.hdds.scm.server.StorageContainerManager.persistSCMCertificates(StorageContainerManager.java:1557)
              at org.apache.hadoop.hdds.scm.server.StorageContainerManager.start(StorageContainerManager.java:1515)
              at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:172)
              at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
              at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
              at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
              at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
              at picocli.CommandLine.access$1300(CommandLine.java:145)
              at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
              at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
              at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
              at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
              at picocli.CommandLine.execute(CommandLine.java:2078)
              at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
              at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
              at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
      
      "IPC Server handler 0 on 9961" #263 daemon prio=5 os_prio=0 cpu=3568.63ms elapsed=206369.56s allocated=11121K defined_classes=216 tid=0x00007f5166457800 nid=0x2082b5 waiting for monitor entry  [0x00007f511fba8000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getCACertificate(DefaultCertificateClient.java:370)
              - waiting to lock <0x000000072a05b650> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
              at org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:491)
              - locked <0x000000072cf80778> (a org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown Source)
              at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
              at org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
              at java.security.AccessController.doPrivileged(java.base@11.0.3/Native Method)
              at javax.security.auth.Subject.doAs(java.base@11.0.3/Subject.java:423)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
      
      "IPC Server handler 1 on 9961" #264 daemon prio=5 os_prio=0 cpu=3501.59ms elapsed=206369.56s allocated=1728K defined_classes=3 tid=0x00007f5166459800 nid=0x2082b6 waiting for monitor entry  [0x00007f511faa7000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:479)
              - waiting to lock <0x000000072cf80778> (a org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown Source)
              at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
              at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
              at org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
              at java.security.AccessController.doPrivileged(java.base@11.0.3/Native Method)
              at javax.security.auth.Subject.doAs(java.base@11.0.3/Subject.java:423)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
      

      Looking at it first these are not trivially causing a dead-lock, but further analysis of the environment shows that:

      • main thread has locked on SCMCertificateClient instance as part of the listCA call. Lock object id: 0x000000072a05b650
      • The server handler #0 for port 9961 waits on locking on SCMCertificateClient as part of the getRootCACertificate while it has already locked SCMSecurityProtocolServer in setRootCAIfNeeded, lock object ID is 0x000000072cf80778
      • The server handler #1 for port 9961 waits on locking on SCMSecurityProtocolServer in the same setRootCAIfNeeded, lock object ID is 0x000000072cf80778

      with that we have depleted the handler threads for the SCMSecurityProtocolServer (by default we have 2), so main thread can not submit the request, and as the retry policy is infinite retry, it waits for some time in between retries, and that is the EPoll wait on which the main thread is waiting in RUNNABLE state, while the other two threads are BLOCKED.

      This situation in our test environment came up during a cluster restart after some SCM node decommission then a new SCM node added, and this restart was taking time, during which service certificates entered the grace period that marks their end of life, and restarting services started to renew their certificates while the leader SCM started up, so with that within the leader SCM, we got into the detailed state. This pretty much blocked the whole startup, certificates could not be renewed, and with that the cluster became unusable.

      Attachments

        1. quasar-kkgvze-6-ozone-scm.log
          101 kB
          Soumitra Sulav
        2. quasar-kkgvze-1-datanode.log
          6.60 MB
          Soumitra Sulav
        3. quasar-kkgvze-2-datanode.log
          6.27 MB
          Soumitra Sulav
        4. quasar-kkgvze-2-ozone-scm.log
          11.72 MB
          Soumitra Sulav
        5. quasar-kkgvze-1-ozone-scm.log
          9.02 MB
          Soumitra Sulav
        6. quasar-kkgvze-2-scmjstack.log
          337 kB
          Sammi Chen

        Issue Links

          Activity

            People

              pifta István Fajth
              ssulav Soumitra Sulav
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: