Details
Description
In an HDFS HA setup if there is a temporary problem with contacting journal nodes (eg. network interruption), the NameNode shuts down entirely, when it should instead go in to a standby mode so that it can stay online and retry to achieve quorum later.
If both NameNodes shut themselves off like this then even after the temporary network outage is resolved, the entire cluster remains offline indefinitely until operator intervention, whereas it could have self-repaired after re-contacting the journalnodes and re-achieving quorum.
2015-04-15 15:59:26,900 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStre am(mgr=QJM to [<ip>:8485, <ip>:8485, <ip>:8485], stream=QuorumOutputStream starting at txid 54270281)) java.io.IOException: Interrupted waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639) at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388) at java.lang.Thread.run(Thread.java:745) 2015-04-15 15:59:26,901 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 54270281 2015-04-15 15:59:26,904 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2015-04-15 15:59:27,001 INFO namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at <custom_scrubbed>/<ip> ************************************************************/
Hari Sekhon
http://www.linkedin.com/in/harisekhon