I attached whirr-167-1.patch. I'm sure that is not a final one, but I would like to hear your opinions too.
I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a minimum percentage of successfully started nodes.
If we are not specify anything, it means %100, so a value
whirr.instance-templates=1 jt+nn,4 dn+tt%60
would mean that "jt+nn" roles passess only when 100% of the nodes start successfully and
"dn+tt" roles passess only when 60% of the nodes starts successfully.
If any of the roles didn't passed the minim requirement, it will initiate a retry phase in which the failing nodes on each roles will be replaced with new ones. That means that even a namenode startup problem wouldn't mean a complete lost cluster.
Without any retries a failure in namenode would break an entire cluster with many dn+tt successfuly started. I think that it worst to minimize the chance to fail in this way, therefore I introduced a retry cycle.
If there are some failure in dn+tt only while passing the minimum limit, the cluster will start up only with that amount of nodes without any retry.
A retry cycle would mean a chance for both roles to increase the number of nodes until the maximum value.
At this moment I don't think that more than one retry it worst! The target is just to replace a few sporadic service problems.
My question would be that we can leave a retry in case of insufficient nodes or we would leave the default value as without retry and add an extra parameter? Initially I wouldn't like the ideea to add more parameters.
About failing nodes... There are 2 different cases:
1. In case when the minimum required nodes couldn't be satisfied by a retry cycle, in that case all of the lost nodes will be left as it is. A full cluster destroy will be able to remove them.
2. In case when the number of nodes is satisfied from the first round or a retry, all the failed nodes (from first round and from retry cycle) will be destroyed automatically at the end of BootstrapClusterAction.doAction.
I experienced some difficulties in destroying the nodes. Initially I used a destroyNodesMatching(Predicate<NodeMetadata> filter) method which would terminate all my enumerated nodes in parallel. But this method would like to delete also the security group and placement group. Then I had to use the simple destroyNode(String id), which now deletes the nodes sequentially and I cannot control the KeyPair delition. My opinion that jclouds library is missing some convenient methods to revoke some nodes without optional propagation of KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get screwed up and I feel I couldn't find an elegant solution which does not incurr the revoke process.
About Mockito simulation of retry....
Unfortunately Mockito failed to mock the static ComputeServiceContextBuilder.build(clusterSpec) method, therefore I could write junit test for the retry. I could only test the retry and bad nodes cleanup by temporary hardcoding the exception then running it in live integration test. If somebody has an ideea how to mock all those static methods in BootstrapClusterAction, feel free to point me a solution.