So what became of this error?
I am pretty that we have observed exactly this problem on one of our test clusters using Cloudera 4.5 (hadoop 2.0.0-cdh4.5.0) release in a Quorum Based HA mode. For a test we intentionally destroyed one of the active namenode's disks using Linux dd command (yeah, its ugly but so is life). The poor thing got stuck in an IO operation trying to close a file. The thread which got blocked, held locks which blocked then a lot of other threads (e.g. threads for incoming RPC calls). That had a fatal impact on the whole cluster, since everything stopped to work at once. HBase, HDFS and all commands did not work and either came back with a timeout or simply hang forever. Unfortunately the live checks from ZKFC seemed to work just fine, so the ZKFC did not detect failures and hence did not trigger a failover.
So we tried to stop it manually. After doing a kill -2 and then a kill -9 on the NameNode process the ZKFC finally detected the error and tried to activate the standby NameNode on another machine. But this got stuck too. I have attached the pstack of this NameNode process as he tries to get active but never made it. As far as I can see he is not able to stop the EditLogTailerThread.
The root cause is probably that the formerly active NameNode was not really dead. After searching around for some time we found that he had left a zombie (defunct process) running, which held the Port 8020 opened! You cannot kill such zombies in Linux without a reboot. So this is exaclty the situation described here. Former NN was frozen but not really dead. And the standby could not go active.
Another sad story is that even the restart of this standby NameNode did not work. It became active, thats fine. But as long as this other zombie was running and kept his 8020 port open, all clients got stuck, so neither HBase started properly, nor could we access the HDFS with the dfs client commands. Just as we rebooted the former NN's machine, the cluster started up properly. But this is probably not part of this Jira. So working with interruptible RPC calls and using a timeout everywhere seems to be vital.