[HDDS-8558] [SCM HA] NotLeaderExceptions after SCM transfer leader to new node. - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Workaround
Affects Version/s: None
Fix Version/s: None
Component/s: SCM HA
Labels:
None

Description

With SCMHA, if a new SCM node is added to the quorum and leadership is manually transferred to the new node, we get NotALeaderExceptions with RPC calls to the SCM. Failover never resolves and never failover to newly added node.

Reproducible,

i.) start SCMHA cluster

ii.) add new SCM to quorum

iii.) manually transfer leader to newly added node

iv.) perform RPC call to SCM from client

Transfer leadership successfully to 00bd9308-3467-4229-8587-3b4576834c72.
bash-4.2$ ozone admin scm roles
com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:f319b7a5-c4b5-48ec-bfef-ed61e6c2e082 is not the leader. Suggested leader is Server:scm4.org:9860.
    at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
    at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
    at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
    at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
, while invoking $Proxy20.submitRequest over nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9860 after 3 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 3.
com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:c8019701-b4ea-42f9-bff5-86087900efe3 is not the leader. Suggested leader is Server:scm4.org:9860.
    at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
    at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
    at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
    at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
, while invoking $Proxy20.submitRequest over nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9860 after 4 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 4.
com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:10496103-c9cf-4275-8b08-c44e08fbc0a6 is not the leader. Suggested leader is Server:scm4.org:9860.
    at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106)
    at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246)
    at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199)
    at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
, while invoking $Proxy20.submitRequest over nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9860 after 5 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 5.

To resolve this and properly configure the SCM HA cluster need to update each SCM node ozone-site.xml with the following properties:

<property>
<name>ozone.scm.nodes.scmservice</name>
<value>scm1,scm2,scm3, new_scm_node_id</value>
</property>

And,

<property>
<name>ozone.scm.address.scmservice.new_scm_host</name>
<value>new_scm_host</value>
</property>

[SCM HA] NotLeaderExceptions after SCM transfer leader to new node.

Details

Description

Attachments

Activity

People

Dates