Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-2823 SCM HA Support
  3. HDDS-5078

[SCM HA Security] NPE during secure SCM initialization with HA code updated to an already existing cluster



    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: SCM HA, Security
    • Labels:


      On a Cloudera Manager managed cluster, scm is started always with --init option specified, and this behaviour revealed the following null pointer dereference:
      StorageContainerManager#initializeCertificateClient initializes the scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized() evaluates to true. This evaluates to true, if the VERSION file contains primaryScmNodeId with a value.

      If you upgrade an existing cluster with a single SCM to this code, the VERSION file does not contain a primaryScmNodeId, so the scmCertificateClient remains null.

      Later the initialization code calls the StorageContainerManager#initializeCAnSecurityProtocol method, which at the end creates the securityProtocolServer, for the constructor call the rootCACert is provided by calling the scmCertificateClient#getCACertificate method, but this is a null dereference as scmCertificateClient is null.

      The scmCertificateClient being null, can cause problems later as well, as it is used multiple times unconditionally.

      Later on after working around this particular problem (by simply let the code create the scmCertificateClient without conditions), it turned out that in the StorageContainerManager#initializeCAnSecurityProtocol call the scmCertificateServer and the rootCertificateServer instances are also remain uninitialized, with that causing problems when an scm client tries to get the root CA certificate from the SCM.
      For me this suggests that initialization of SCM fails after an upgrade on an old cluster, this was working fine before, and --init did not reinitialized anything, but worked fine.

      If I change Cloudera Manager behaviour to do not init the SCM when I start it, I still get the same NPE as with --init from the SCM.
      The exception I get in the SCM log is as follows, the command I issue is a recommission of a formerly (before upgrade) decommissioned DN.

      	at org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMGetCertResponseProto$Builder.setX509RootCACertificate(SCMSecurityProtocolProtos.java:9026)
      	at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getCACertificate(SCMSecurityProtocolServerSideTranslatorPB.java:257)
      	at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:104)
      	at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
      	at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:89)
      	at org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:10537)
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:986)
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:914)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2887)


          Issue Links



              • Assignee:
                bharat Bharat Viswanadham
                pifta István Fajth
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: