Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3816 Erasure Coding
  3. HDDS-6445

EC: Fix allocateBlock failure due to inaccurate excludedNodes check.

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • EC-Branch
    • None

    Description

      I hit a problem we doing the following test:

      17 DNs, ockg -p test -n 10 -s $((4*1024*1024*1024)) -t 10, shutdown 3 DNs one by one.

      client trace:

      java.io.IOException: Allocated 0 blocks. Requested 1 blocks        at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:175)        at org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50)        at org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76)        at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:145)        at com.codahale.metrics.Timer.time(Timer.java:101)        at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:142)        at org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183)        at org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163)        at org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)        at java.lang.Thread.run(Thread.java:748) 

      SCM trace:

      2022-03-11 09:16:33,562 [IPC Server handler 74 on default port 9863] ERROR org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider: Unable to allocate a container for EC/ECReplicationConfig{data=10, parity=4, ecChunkSize=1048576, codec=rs} after trying all existing containersorg.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to choose. TotalNode = 15 RequiredNode = 14 ExcludedNode = 2        at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodes(SCMContainerPlacementRackScatter.java:105)        at org.apache.hadoop.hdds.scm.pipeline.ECPipelineProvider.create(ECPipelineProvider.java:74)        at org.apache.hadoop.hdds.scm.pipeline.ECPipelineProvider.create(ECPipelineProvider.java:40)        at org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:90)        at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:180)        at org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.allocateContainer(WritableECContainerProvider.java:168)        at org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.getContainer(WritableECContainerProvider.java:151)        at org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.getContainer(WritableECContainerProvider.java:51)        at org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:59)        at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:176)        at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194)        at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180)        at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130)        at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)        at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112)        at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14202)        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:422)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966) 

      The problems is the line:

      containersorg.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to choose. TotalNode = 15 RequiredNode = 14 ExcludedNode = 2 

      Actually I only shutdown 3 out of 17 DNs, so there should be 14 DNs left that should be enough for EC 10+4.

      Here we see that we have 15 DNs(right after a kill operation, so SCM didn't get stale events), are 2 excluded DNs, so intuitively, there are not enough DNs. But remember that we only killed 3 DNs, we should have enough DNs left.

      So the problem is that one of the excluded DN is marked stale/dead(I mean the second DN that killed, not the last one), so this one is not included in the 15 DNs shown, so we can't just simply do 15 - 2 = 13 < 14, and throw.

      Attachments

        Issue Links

          Activity

            People

              markgui Mark Gui
              markgui Mark Gui
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: