Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-10256

Block allocation should retry if SCM is in safe mode

    XMLWordPrintableJSON

Details

    Description

      pratyush.bhatt found that HBase goes down when Ozone is in Rolling restart. Turns out OM doesn't seem to retry allocating blocks if SCM is in safe mode.

      2024-01-30 16:57:39,846 [om1-OMStateMachineApplyTransactionThread - 0] INFO  bucket.OMBucketCreateRequest (OMBucketCreateRequest.java:validateAndUpdateCache(296)) - created bucket: weichiu of layout FILE_SYSTEM_OPTIMIZED in volume: user
      16:57:39.869 [IPC Server handler 0 on default port 15036] ERROR SCMAudit - user=weichiu | ip=10.96.129.4 | op=ALLOCATE_BLOCK {replication=RATIS/THREE, owner=omServiceIdDefault, size=4194304, num=1, client=} | ret=FAILURE
      org.apache.hadoop.hdds.scm.exceptions.SCMException: SafeModePrecheck failed for allocateBlock
      	at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:157) ~[classes/:?]
      	at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:204) ~[classes/:?]
      	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:192) ~[classes/:?]
      	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:142) ~[classes/:?]
      	at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89) [classes/:?]
      	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:113) [classes/:?]
      	at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14430) [classes/:?]
      	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484) [hadoop-common-3.3.6.jar:?]
      	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595) [hadoop-common-3.3.6.jar:?]
      	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) [hadoop-common-3.3.6.jar:?]
      	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) [hadoop-common-3.3.6.jar:?]
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094) [hadoop-common-3.3.6.jar:?]
      	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017) [hadoop-common-3.3.6.jar:?]
      	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_392]
      	at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_392]
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) [hadoop-common-3.3.6.jar:?]
      	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048) [hadoop-common-3.3.6.jar:?]
      

      We should retry. I'll attach a reproduction test case for reference.

      Attachments

        Issue Links

          Activity

            People

              ashishk Ashish Kumar
              pratyush.bhatt Pratyush Bhatt
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: