> AM talking to the RM – the AM currently logs an exception and
> continues (RMCommunicator.startAllocatorThread()). This should
> be fixed in the MRAppMaster based on the kind of exception
> (temporary timeout versus some kind of a kill response from the RM).
If I'm debugging this correctly, the RMCommunicator (via the startAllocationThread() method) sends a heartbeat to the RM. This heartbeat does catch an UndeclaredThrowableException when the RM is down, caused a ConnectException. The RMCommunicator sends the heartbeat about every second (depending on the config option), and this exception is thrown during each heartbeat as long as the RM is down. When the RM comes back up, however, exceptions stop being thrown altogether.
I'm still investigating to see why no exception is thrown.
It seems that the "right" thing for this communication mechanism between the RM and the AM to recognize that the AM is no longer valid and throw the appropriate exception so that the AM can exit cleanly.
It looks like when the "rogue" MRAM contacts the RM, the RM is telling the AM to reboot, but the RMAM is ignoring it.
I would say that on the MRAM side, RMContainerAllocator.getResource() calls RMContainerRequestor.makeRemoteRequest() to get the response from the RM. At that point, RMContainerAllocator.getResource() should check the reboot flag from the response and throw an exception, which should cause RMCommunicator thread to exit.