Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.2.0
-
None
Description
So I found this bug when I tried to add robot tests to the ozone debug CLI, but I was able to recreate it locally. I had three datanodes and created a new pipeline with the ozone admin pipeline create command, which chose a datanode and made a STANDALONE/ONE pipeline with it. After that I stopped a datanode and waited until it had a DEAD state; after I started it again it didn't create a RATIS/THREE pipeline, even though there were three healthy datanodes and no RATIS/THREE pipeline.
In the docker-config the ozone.scm.datanode.pipeline.limit property is set to 1 (the default is 2) due to the multi raft support. When we are trying to create the pipeline we are making a healthy datanode list where we are filtering the list based on the pipeline limit. We are calculating the currect pipeline count like this on a datanode:
int currentPipelineCount(DatanodeDetails datanodeDetails, int nodesRequired) { // Datanodes from pipeline in some states can also be considered available // for pipeline allocation. Thus the number of these pipeline shall be // deducted from total heaviness calculation. int pipelineNumDeductable = 0; Set<PipelineID> pipelines = nodeManager.getPipelines(datanodeDetails); for (PipelineID pid : pipelines) { Pipeline pipeline; try { pipeline = stateManager.getPipeline(pid); } catch (PipelineNotFoundException e) { LOG.debug("Pipeline not found in pipeline state manager during" + " pipeline creation. PipelineID: {}", pid, e); continue; } if (pipeline != null && // single node pipeline are not accounted for while determining // the pipeline limit for dn pipeline.getType() == HddsProtos.ReplicationType.RATIS && (RatisReplicationConfig .hasFactor(pipeline.getReplicationConfig(), ReplicationFactor.ONE) || pipeline.getReplicationConfig().getRequiredNodes() == nodesRequired && pipeline.getPipelineState() == Pipeline.PipelineState.CLOSED)) { pipelineNumDeductable++; } } return pipelines.size() - pipelineNumDeductable; }
We are only deducting the RATIS replication type pipelines (due to this condition: pipeline.getType() == HddsProtos.ReplicationType.RATIS), so will count in the STANDALONE/ONE pipeline and because of that we will reach the pipeline limit on that datanode, therefore we won't create a RATIS/THREE pipeline.
We should deduct all the single node pipelines in this check.
Attachments
Issue Links
- links to