1. Alternate between using haadmin and kill -9'ing the Namenode. We shouldn't see a difference here, but it would be nice to test coordinated failover and automatic failover
As I mentioned to Keith Turner in the review, doing this will require a heuristic. We can ask the HDSF admin tools for the hostname corresponding to the namenode id, but picking out the namenode process will be version dependent. I think he and I agreed that that sort of thing was better left to something like BigTop, since it attempts to work across projects.
2. Some more validation before anything else: Can the user sudo to the hdfs admin user as they claim?
opened as ACCUMULO-1982 about using sudo to users generally.
Do the executables (hdfs, sudo) exist?
The existing tests for executability should cover this, no? Or are you looking for more specific error messages?
Does the namespace provided exist (or can we find any namespaces if we're using all of them)?
Both of these cases are handled by the current error checking. the error message for the former is confusing (the message complains of a missing configuration value).
Can we find namenodes for the namespaces configured?
This is covered in the current error handling.
The only other thing I'm curious about is when the script tries to choose an random namenode to make active, could we ever get in that block while ZFKC is in the middle of transition? In other words, is it possible to have no active namenodes while automatic failover is happening and we get an error because we try to force the transition?
Yes, this is certainly possible. As things currently are, we'll simply log a message that this happened and try again the next time around. I couldn't think of anything else worth doing in that case.
Note that it's also possible for an automatic failover to have changed which namenode is active while we are in the block that says to use the failover command. In that case, if there are only 2 namenodes we'll just do a no-op failover that says everything went fine. If Hadoop adds more than 2 namenodes per nameservice in the future, then I don't know what it will do but I know we'll log it and try again later.