dfs.namenode.standby.checkpoints - perhaps include ".ha" in there to make it clear that this option is only applicable in an HA setup
renamed to dfs.ha.standby.checkpoints and DFS_HA_STANDBY_CHECKPOINTS_KEY
Might as well make the members of CheckpointConf final.
LOG.info("Counted txns in " + file + ": " + val.getNumTransactions()); - Either should be removed or should not be info level.
prepareStopStandbyServices is kind of a weird name. Perhaps "prepareToStopStandbyServices" ?
"// TODO interface audience" in TransferFsImage
"TODO: need to cancel the savenamespace operation if it's in flight" - I think this comment is no longer applicable to this patch, right?
LOG.info("Time for a checkpoint !"); - while strictly accurate, this doesn't seem to be the most helpful log message.
e.printStackTrace(); in CheckpointerThread should probably be tossed.
Nit: in CheckpointerThread#doWork: "if(UserGroupInformation.isSecurityEnabled())" - space between "if" and "(", and curly braces around body of "if".
You use "System.currentTimeMillis" in a bunch of places. How about replacing with "o.a.h.hdfs.server.common.Util#now" ?
fixed the above
Does it not seem strange to you that the order of operations when setting a state is "prepareExit -> prepareEnter -> exit -> enter," instead of "prepareExit -> exit -> prepareEnter -> enter
The point of the prepare* methods is that they have to happen before the lock is taken. So, prepareEnter can't happen after exit, because the lock already is held there. I clarified the javadoc a bit.
What's the point of the changes in EditLogTailer?
In order for the test to spy on saveNamespace, I had to move the getFSImage call down. Otherwise, the spy wasn't getting picked up properly and the test was failing.
Can we make CheckpointerThread a static inner class?
Currently it calls doCheckpoint in the outer class. I suppose it could be static, but it isn't really easy to test in isolation anyway, so I'm going to punt o this.
Does it make sense to explicitly disallow the SBN from allowing checkpoints to be uploaded to it?
Yes and no... I sort of see your point. But, people have also discussed an external tool which would perform checkpoints for many clusters and then upload them