When the EditLogTailer thread calls rollEdits() on the active NN via RPC, it currently does so without a timeout. So, if the active NN has frozen (but not actually crashed), this call can hang forever. This can then potentially prevent the standby from becoming active.
This may actually considered a side effect of
HADOOP-6762 – if the RPC were interruptible, that would also fix the issue.