Test failures are unrelated and tracked elsewhere.
Patch looks pretty good, but there are some nits:
Not caused by this JIRA, but the node rejoin transition has the contains-followed-by-get problem. If another thread removes the node after the containsKey call then the get will return null but the code doesn't handle that and will NPE. It is safer and faster to just do the get then check for null rather than perform the key lookup twice. This also avoids the awkward declaring of the previousRMNode variable with no initial value.
Similarly we should not call get-then-remove when checking for the unknown node. Instead we can simply call remove on the unknown node ID and check if the remove returned anything to know if it was there.
In TestResourceTrackerService, wouldn't it be simpler to create an overloaded form of writeToHostsFile that takes the specified file rather than replacing the existing method? Then we can implement the original method in terms of the new method and cut out a large portion of this patch where it has to fixup all the original calls of the removed method.
4 seconds seems pretty aggressive for a test timeout, especially with multiple RM bringups and teardowns involved. If this runs on a sluggish jenkins machine that happens to pause at the wrong time then the test fails – i.e.: is it important that this test fails if it executes in 5 seconds instead? Seems like the timeout should be at least 20 seconds, if there is an explicit timeout specified at all (surefire has one built in).
Should we just change MockRM to use the drain dispatcher and expose a drain events method rather than fix a bunch of places to override which dispatcher to use?