So I found this bug when I tried to add robot tests to the ozone debug CLI, but I was able to recreate it locally. I had three datanodes and created a new pipeline with the ozone admin pipeline create command, which chose a datanode and made a STANDALONE/ONE pipeline with it. After that I stopped a datanode and waited until it had a DEAD state; after I started it again it didn't create a RATIS/THREE pipeline, even though there were three healthy datanodes and no RATIS/THREE pipeline.
In the docker-config the ozone.scm.datanode.pipeline.limit property is set to 1 (the default is 2) due to the multi raft support. When we are trying to create the pipeline we are making a healthy datanode list where we are filtering the list based on the pipeline limit. We are calculating the currect pipeline count like this on a datanode:
We are only deducting the RATIS replication type pipelines (due to this condition: pipeline.getType() == HddsProtos.ReplicationType.RATIS), so will count in the STANDALONE/ONE pipeline and because of that we will reach the pipeline limit on that datanode, therefore we won't create a RATIS/THREE pipeline.
We should deduct all the single node pipelines in this check.