Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Workaround
-
None
-
None
-
None
Description
With SCMHA, if a new SCM node is added to the quorum and leadership is manually transferred to the new node, we get NotALeaderExceptions with RPC calls to the SCM. Failover never resolves and never failover to newly added node.
Reproducible,
i.) start SCMHA cluster
ii.) add new SCM to quorum
iii.) manually transfer leader to newly added node
iv.) perform RPC call to SCM from client
Transfer leadership successfully to 00bd9308-3467-4229-8587-3b4576834c72. bash-4.2$ ozone admin scm roles com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:f319b7a5-c4b5-48ec-bfef-ed61e6c2e082 is not the leader. Suggested leader is Server:scm4.org:9860. at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106) at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246) at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199) at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026) , while invoking $Proxy20.submitRequest over nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9860 after 3 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 3. com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:c8019701-b4ea-42f9-bff5-86087900efe3 is not the leader. Suggested leader is Server:scm4.org:9860. at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106) at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246) at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199) at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026) , while invoking $Proxy20.submitRequest over nodeId=scm2,nodeAddress=scm2.org/172.25.0.117:9860 after 4 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 4. com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdds.ratis.ServerNotLeaderException): Server:10496103-c9cf-4275-8b08-c44e08fbc0a6 is not the leader. Suggested leader is Server:scm4.org:9860. at org.apache.hadoop.hdds.ratis.ServerNotLeaderException.convertToNotLeaderException(ServerNotLeaderException.java:106) at org.apache.hadoop.hdds.scm.ha.RatisUtil.checkRatisException(RatisUtil.java:246) at org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocolServerSideTranslatorPB.submitRequest(StorageContainerLocationProtocolServerSideTranslatorPB.java:199) at org.apache.hadoop.hdds.protocol.proto.StorageContainerLocationProtocolProtos$StorageContainerLocationProtocolService$2.callBlockingMethod(StorageContainerLocationProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026) , while invoking $Proxy20.submitRequest over nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9860 after 5 failover attempts. Trying to failover after sleeping for 2000ms. Current retry count: 5.
To resolve this and properly configure the SCM HA cluster need to update each SCM node ozone-site.xml with the following properties:
<property> <name>ozone.scm.nodes.scmservice</name> <value>scm1,scm2,scm3, new_scm_node_id</value> </property>
And,
<property> <name>ozone.scm.address.scmservice.new_scm_host</name> <value>new_scm_host</value> </property>