Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-8298

HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.6.0, 2.7.3
    • Fix Version/s: None
    • Component/s: ha, namenode, qjm
    • Labels:
      None
    • Environment:

      multiple clients, HDP 2.2, HDP 2.5, CDH etc

      Description

      In an HDFS HA setup if there is a temporary problem with contacting journal nodes (eg. network interruption), the NameNode shuts down entirely, when it should instead go in to a standby mode so that it can stay online and retry to achieve quorum later.

      If both NameNodes shut themselves off like this then even after the temporary network outage is resolved, the entire cluster remains offline indefinitely until operator intervention, whereas it could have self-repaired after re-contacting the journalnodes and re-achieving quorum.

      2015-04-15 15:59:26,900 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStre
      am(mgr=QJM to [<ip>:8485, <ip>:8485, <ip>:8485], stream=QuorumOutputStream starting at txid 54270281))
      java.io.IOException: Interrupted waiting 20000ms for a quorum of nodes to respond.
              at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
              at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
              at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
              at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
              at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
              at java.lang.Thread.run(Thread.java:745)
      2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 54270281
      2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
      2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
      /************************************************************
      SHUTDOWN_MSG: Shutting down NameNode at <custom_scrubbed>/<ip>
      ************************************************************/

      Hari Sekhon
      http://www.linkedin.com/in/harisekhon

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                harisekhon Hari Sekhon
              • Votes:
                5 Vote for this issue
                Watchers:
                34 Start watching this issue

                Dates

                • Created:
                  Updated: