Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
I hit a problem we doing the following test:
17 DNs, ockg -p test -n 10 -s $((4*1024*1024*1024)) -t 10, shutdown 3 DNs one by one.
client trace:
java.io.IOException: Allocated 0 blocks. Requested 1 blocks at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:175) at org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50) at org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76) at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:145) at com.codahale.metrics.Timer.time(Timer.java:101) at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:142) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163) at org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
SCM trace:
2022-03-11 09:16:33,562 [IPC Server handler 74 on default port 9863] ERROR org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider: Unable to allocate a container for EC/ECReplicationConfig{data=10, parity=4, ecChunkSize=1048576, codec=rs} after trying all existing containersorg.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to choose. TotalNode = 15 RequiredNode = 14 ExcludedNode = 2 at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackScatter.chooseDatanodes(SCMContainerPlacementRackScatter.java:105) at org.apache.hadoop.hdds.scm.pipeline.ECPipelineProvider.create(ECPipelineProvider.java:74) at org.apache.hadoop.hdds.scm.pipeline.ECPipelineProvider.create(ECPipelineProvider.java:40) at org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:90) at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:180) at org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.allocateContainer(WritableECContainerProvider.java:168) at org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.getContainer(WritableECContainerProvider.java:151) at org.apache.hadoop.hdds.scm.pipeline.WritableECContainerProvider.getContainer(WritableECContainerProvider.java:51) at org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:59) at org.apache.hadoop.hdds.scm.block.BlockManagerImpl.allocateBlock(BlockManagerImpl.java:176) at org.apache.hadoop.hdds.scm.server.SCMBlockProtocolServer.allocateBlock(SCMBlockProtocolServer.java:194) at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.allocateScmBlock(ScmBlockLocationProtocolServerSideTranslatorPB.java:180) at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.processMessage(ScmBlockLocationProtocolServerSideTranslatorPB.java:130) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87) at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:112) at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:14202) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
The problems is the line:
containersorg.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to choose. TotalNode = 15 RequiredNode = 14 ExcludedNode = 2
Actually I only shutdown 3 out of 17 DNs, so there should be 14 DNs left that should be enough for EC 10+4.
Here we see that we have 15 DNs(right after a kill operation, so SCM didn't get stale events), are 2 excluded DNs, so intuitively, there are not enough DNs. But remember that we only killed 3 DNs, we should have enough DNs left.
So the problem is that one of the excluded DN is marked stale/dead(I mean the second DN that killed, not the last one), so this one is not included in the 15 DNs shown, so we can't just simply do 15 - 2 = 13 < 14, and throw.
Attachments
Issue Links
- links to