Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-10536

Ambari 2.0 HDP 2.2.4 => 2.2.0 stack rollback leaves one NameNode in inconsistent state, breaking HA and failover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.0.0
    • None
    • ambari-server, stacks
    • None
    • HDP 2.2.0.0 <= rollback <= 2.2.4.0

    Description

      After a failed stack upgrade of HDP 2.2.0 => 2.2.4 (AMBARI-10519) and subsequent rollback, Ambari 2.0 leaves one of the HDFS HA NameNodes in an inconsistent state:

      2015-04-16 11:45:38,231 INFO  namenode.FSImage (FSEditLogLoader.java:loadFSEdits(138)) - Start loading edits file http://<custom_scrubbed>:8480/getJournal?jid=nameservice1&segmentTxId=54367965&storageInfo=-60%3A1459025177%3A1418910715375%3ACID-8055996a-b5ce-4b07-9b32-f2dbe9123edd, http://<custom_scrubbed>:8480/getJournal?jid=nameservice1&segmentTxId=54367965&storageInfo=-60%3A1459025177%3A1418910715375%3ACID-8055996a-b5ce-4b07-9b32-f2dbe9123edd
      2015-04-16 11:45:38,232 INFO  namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://<custom_scrubbed>:8480/getJournal?jid=nameservice1&segmentTxId=54367965&storageInfo=-60%3A1459025177%3A1418910715375%3ACID-8055996a-b5ce-4b07-9b32-f2dbe9123edd, http://<custom_scrubbed>:8480/getJournal?jid=nameservice1&segmentTxId=54367965&storageInfo=-60%3A1459025177%3A1418910715375%3ACID-8055996a-b5ce-4b07-9b32-f2dbe9123edd' to transaction ID 54367965
      2015-04-16 11:45:38,232 INFO  namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://<custom_scrubbed>:8480/getJournal?jid=nameservice1&segmentTxId=54367965&storageInfo=-60%3A1459025177%3A1418910715375%3ACID-8055996a-b5ce-4b07-9b32-f2dbe9123edd' to transaction ID 54367965
      2015-04-16 11:45:38,284 ERROR namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(238)) - Encountered exception on operation RollingUpgradeOp [START, time=1429181084342]
      org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data1/nn is in an inconsistent state: previous fs state should not exist during upgrade. Finalize or rollback first.
              at org.apache.hadoop.hdfs.server.namenode.FSImage.checkUpgrade(FSImage.java:348)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startRollingUpgradeInternal(FSNamesystem.java:8322)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:750)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:805)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:230)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:356)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1608)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
      2015-04-16 11:45:39,111 FATAL ha.EditLogTailer (EditLogTailer.java:doWork(331)) - Unknown error encountered while tailing edits. Shutting down standby NN.
      org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data1/nn is in an inconsistent state: previous fs state should not exist during upgrade. Finalize or rollback first.
              at org.apache.hadoop.hdfs.server.namenode.FSImage.checkUpgrade(FSImage.java:348)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startRollingUpgradeInternal(FSNamesystem.java:8322)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:750)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:805)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:230)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:356)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1608)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
      2015-04-16 11:45:39,114 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
      2015-04-16 11:45:39,115 INFO  namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG:
      /************************************************************
      SHUTDOWN_MSG: Shutting down NameNode at <custom_scrubbed>/<custom_scrubbed>
      ************************************************************/

      The NameNode was shut down as a result, and after restarting it, it still doesn't work properly as doing ha admin failover commands return similar exceptions complaining about this inconsistent state, which should be visible in the NameNode logs I've uploaded.

      Hari Sekhon
      http://www.linkedin.com/in/harisekhon

      Attachments

        1. remaining-namenode-nn2.log.bz2
          2.05 MB
          Hari Sekhon
        2. broken-namenode-nn1.log.bz2
          3.78 MB
          Hari Sekhon

        Activity

          People

            Unassigned Unassigned
            harisekhon Hari Sekhon
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: