We recently encountered an issue on a large cluster (running 2.7.4) in which the NameNode killed itself because it was unable to communicate with the JNs via QJM. We discovered that it was the result of the NameNode trying to send a huge batch of over 1 million transactions to the JNs in a single RPC:
The JournalNodes rejected the RPC because it had a size well over the 64MB default ipc.maximum.data.length.
This was triggered by a huge number of files all hitting a hard lease timeout simultaneously, causing the NN to force-close them all at once. This can be a particularly nasty bug as the NN will attempt to re-send this same huge RPC on restart, as it loads an fsimage which still has all of these open files that need to be force-closed.
To solve this we propose to modify EditsDoubleBuffer to add a "hard limit" based on the value of ipc.maximum.data.length. When writeOp() or writeRaw() is called, first check the size of bufCurrent. If it exceeds the hard limit, block the writer until the buffer is flipped and bufCurrent becomes bufReady. This gives some self-throttling to prevent the NameNode from killing itself in this way.