Attaching a patch which should address all of the individual comments:
assertAllResultsEqual - for loop can just start with i = 1? Also if the collection objects is of size zero or one, the method can return early. Is there a need to do object.toArray() for these early checks? With that, perhaps the findbugs exclude may not be necessary.
Good thinking re: starting the loop at 1 and exiting early. Even with that, I don't think the findbugs exclude can be removed. The issue is this: "(currElement == null && currElement != lastElement)". In that code we know that currElement is equal to null, and then we compare the known-null value to something else. This is correct, and even though I could have explicitly compared lastElement to null, this seemed clearer to me. Let me know if you disagree.
Unit test can be added for methods isAtLeastOneActive, getRpcAddressesForNameserviceId and getProxiesForAllNameNodesInNameservice (I am okay if this is done in a separate jira)
Done. I added a single unit test which wouldn't work if any of the three of these were broken, since isAtLeastOneActive by definition uses the other two.
Finalizing upgrade is quite tricky. Consider the following scenarios:
See below for comments on this.
Why is throw new AssertionError("Unreachable code."); in QuorumJournalManager.java methods?
These methods have to have to either return something or unconditionally throw an exception. In fact this method will always either return a value (when all JNs are up and responsive) or else will throw an exception. However, the compiler can't discern this, so I put in these exceptions to placate it. It was Todd's suggestion to throw AssertionError specifically. See his comment on
HDFS-5138 on 2014/01/07.
FSImage#doRollBack() - when canRollBack is false after checking if non-share directories can rollback, an exception must be immediately thrown, instead of checking shared editlog.
I don't think I agree with this. Imagine a scenario where an earlier rollback attempt had succeeded for all the local storage dirs, but not the shared edit log. Shouldn't we continue the rollback in this case?
Also printing Log.info when storages can be rolled back will help in debugging.
Good thinking. Done.
FSEditlog#canRollBackSharedLog should accept StorageInfo instead of Storage
Good catch. Done.
QuorumJournalManager#canRollBack and getJournalCTime can throw AssertionError (from DFSUtil.assertAllResultsEqual()). Is that the right exception to expose or IOException?
I don't think it actually makes a functional difference, since in either case we bail out of the process, but I've changed it nonetheless since we tend to favor IOExceptions throughout the codebase.
Namenode startup throws AssertionError with -rollback option. I think we should throw IOException, which is how all the other failures are indicated.
Note that this code shouldn't be reachable anymore from actually running the command since NameNode#doRollback doesn't actually start the NN daemon anymore - just performs the rollback and shuts down. I only left this in there to make sure that in the future we don't inadvertently reintroduce the old behavior somehow. Given this, I think we should leave it as an AssertionError. Do you agree?
I've also updated the documentation as you suggested.
As for handling the partial upgrade failure as you've described, I'd like to add one more RPC call to the JournalManager to initiate analysis/recovery of the storage dirs upon first contact, and then refactor the contents of FSImage#recoverStorageDirs into NNUpgradeUtil just like was done with the other upgrade-related procedures. If this sounds OK to you, I'll go ahead and add that stuff and appropriate tests.