Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
pratyush.bhatt found that HBase goes down when Ozone is in Rolling restart. Turns out OM doesn't seem to retry allocating blocks if SCM is in safe mode.
2024-01-30 16:57:39,846 [om1-OMStateMachineApplyTransactionThread - 0] INFO bucket.OMBucketCreateRequest (OMBucketCreateRequest.java:validateAndUpdateCache(296)) - created bucket: weichiu of layout FILE_SYSTEM_OPTIMIZED in volume: user 16:57:39.869 [IPC Server handler 0 on default port 15036] ERROR SCMAudit - user=weichiu | ip=10.96.129.4 | op=ALLOCATE_BLOCK {replication=RATIS/THREE, owner=omServiceIdDefault, size=4194304, num=1, client=} | ret=FAILURE org.apache.hadoop.hdds.scm.exceptions.SCMException: SafeModePrecheck failed for allocateBlock at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:157) ~[classes/:?] at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:204) ~[classes/:?] at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:192) ~[classes/:?] at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:142) ~[classes/:?] at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89) [classes/:?] at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:113) [classes/:?] at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14430) [classes/:?] at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484) [hadoop-common-3.3.6.jar:?] at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595) [hadoop-common-3.3.6.jar:?] at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) [hadoop-common-3.3.6.jar:?] at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) [hadoop-common-3.3.6.jar:?] at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094) [hadoop-common-3.3.6.jar:?] at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017) [hadoop-common-3.3.6.jar:?] at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_392] at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_392] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) [hadoop-common-3.3.6.jar:?] at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048) [hadoop-common-3.3.6.jar:?]
We should retry. I'll attach a reproduction test case for reference.
Attachments
Issue Links
- Discovered while testing
-
HDDS-7593 Supporting HSync and lease recovery
- Resolved
- links to