I'd really like to see the minicluster not startup by default with a race condition where it hasn't actually finished starting. With multiple tests currently failing sporadically due to this, I'd like the start() method to not return until the cluster is started. For non-HA setups this seems very straightforward.
However for the HA minicluster it appears the intent is to have the RMs all come up in standby. The problem is that the NM start method will not return until it has successfully registered with an RM. Since all RMs are in standby the NM start never completes, the minicluster start never completes, and we never get to the part of the test where it activates an RM. Therefore HA minicluster tests will always timeout.
I like Eric's proposal to have the minicluster activate the first RM during the start method of an HA cluster so we can bring it up and return from the cluster start method with no pending start processing (and therefore race conditions in the test using the cluster). However that could break some of the assumptions of those using the HA minicluster in their existing tests. For Hadoop tests we can simply fixup the tests accordingly, if necessary (since most seem to activate the first one anyway), but I don't know if there are other tests that use an HA minicluster and will break if the first RM is already active by default.
Karthik Kambatla do you have an opinion on this?