Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8298

HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 2.6.0, 2.7.3
    • None
    • ha, namenode, qjm
    • None
    • multiple clients, HDP 2.2, HDP 2.5, CDH etc

    Description

      In an HDFS HA setup if there is a temporary problem with contacting journal nodes (eg. network interruption), the NameNode shuts down entirely, when it should instead go in to a standby mode so that it can stay online and retry to achieve quorum later.

      If both NameNodes shut themselves off like this then even after the temporary network outage is resolved, the entire cluster remains offline indefinitely until operator intervention, whereas it could have self-repaired after re-contacting the journalnodes and re-achieving quorum.

      2015-04-15 15:59:26,900 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStre
      am(mgr=QJM to [<ip>:8485, <ip>:8485, <ip>:8485], stream=QuorumOutputStream starting at txid 54270281))
      java.io.IOException: Interrupted waiting 20000ms for a quorum of nodes to respond.
              at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
              at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
              at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
              at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
              at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
              at java.lang.Thread.run(Thread.java:745)
      2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 54270281
      2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
      2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
      /************************************************************
      SHUTDOWN_MSG: Shutting down NameNode at <custom_scrubbed>/<ip>
      ************************************************************/

      Hari Sekhon
      http://www.linkedin.com/in/harisekhon

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              harisekhon Hari Sekhon
              Votes:
              3 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

                Created:
                Updated: