@Eli ... as Todd points out not all OOMs are unrecoverable ...
On the NN I'd rather see the critical threads all get uncaughtExceptionHandlers attached which abort the NN if they fail. So if an individual rpc handler OOMEs (eg by an invalid request making it try to allocate a 4G array or something) it won't take down the NN, whereas if the LeaseManager OOMEs it should.
I think this may not be a good idea. Infact I would say, it is more important to shutdown NN when RPC handler gets an OOME. Lets say an RPC handler updated in memory namespace and was about add it to editlog. The system was indeed running out of memory and before editlog could be written the handler got OOME. If we do not shutdown at this time, we could end up in interesting data corruption issues.
Instead of trying to categorize which one is safe and not safe, we should use kill -9 option. In cases where OOME is caused by the system trying to create a large object, we could add appropriate size/limit checks.