Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-3399 BookKeeper option support for NN HA
  3. HDFS-3441

Race condition between rolling logs at active NN and purging at standby

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.0-alpha
    • 2.0.2-alpha
    • None
    • None
    • Reviewed

    Description

      Standby NN has got the ledgerlist with list of all files, including the inprogress file (with say inprogress_val1)
      Active NN has done finalization and created new inprogress file.
      Standby when proceeds further finds that the inprogress file which it had in the list is not present and NN gets shutdown

      NN Logs
      =========
      2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 201 saved in 0 seconds.
      2012-05-17 22:15:03,874 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /xx.xx.xx.102:8020
      2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 111
      2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_0000000000000000109, cpktTxId=0000000000000000109)
      2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767, stream=null))
      java.io.IOException: Exception reading ledger list from zk
      at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
      at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
      at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
      at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
      Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
      at org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
      at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
      ... 16 more
      2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

      ZK Data
      ================

      [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
      -40;59;116
      cZxid = 0x2be
      ctime = Thu May 17 22:15:03 IST 2012
      mZxid = 0x2be
      mtime = Thu May 17 22:15:03 IST 2012
      pZxid = 0x2be
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 10
      numChildren = 0

      Attachments

        1. HDFS-3441.1.patch
          6 kB
          Rakesh Radhakrishnan
        2. HDFS-3441.2.patch
          8 kB
          Rakesh Radhakrishnan
        3. HDFS-3441.3.patch
          8 kB
          Uma Maheswara Rao G
        4. HDFS-3441.3.patch
          8 kB
          Rakesh Radhakrishnan
        5. HDFS-3441.patch
          4 kB
          Rakesh Radhakrishnan

        Activity

          People

            rakeshr Rakesh Radhakrishnan
            suja suja s
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: