Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-8558

[SCM HA] NotLeaderExceptions after SCM transfer leader to new node.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Workaround
    • None
    • None
    • SCM HA
    • None

    Description

      With SCMHA, if a new SCM node is added to the quorum and leadership is manually transferred to the new node, we get NotALeaderExceptions with RPC calls to the SCM.  Failover never resolves and never failover to newly added node.

       

      Reproducible, 

      i.) start SCMHA cluster

      ii.) add new SCM to quorum

      iii.) manually transfer leader to newly added node

      iv.) perform RPC call to SCM from client 

      Transfer leadership successfully to 00bd9308-3467-4229-8587-3b4576834c72.
      bash-4.2$ ozone admin scm roles
      com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:f319b7a5-c4b5-48ec-bfef-ed61e6c2e082 is not the leader. Suggested leader is Server:scm4.org:9860.
          at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
          at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
          at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
          at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
          at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
          at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
          at java.base/java.security.AccessController.doPrivileged(Native Method)
          at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
      , while invoking $Proxy20.submitRequest over nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9860 after 3 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 3.
      com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:c8019701-b4ea-42f9-bff5-86087900efe3 is not the leader. Suggested leader is Server:scm4.org:9860.
          at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
          at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
          at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
          at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
          at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
          at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
          at java.base/java.security.AccessController.doPrivileged(Native Method)
          at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
      , while invoking $Proxy20.submitRequest over nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9860 after 4 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 4.
      com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:10496103-c9cf-4275-8b08-c44e08fbc0a6 is not the leader. Suggested leader is Server:scm4.org:9860.
          at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
          at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
          at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
          at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
          at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
          at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
          at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
          at java.base/java.security.AccessController.doPrivileged(Native Method)
          at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
      , while invoking $Proxy20.submitRequest over nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9860 after 5 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 5.

       

       

       

      To resolve this and properly configure the SCM HA cluster need to update each SCM node ozone-site.xml with the following properties:

       

      <property>
      <name>ozone.scm.nodes.scmservice</name>
      <value>scm1,scm2,scm3, new_scm_node_id</value>
      </property>
      

       

      And,   

       

      <property>
      <name>ozone.scm.address.scmservice.new_scm_host</name>
      <value>new_scm_host</value>
      </property>

      Attachments

        Activity

          People

            NeilJoshi Neil Joshi
            NeilJoshi Neil Joshi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: