Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-3399 BookKeeper option support for NN HA
  3. HDFS-3441

Race condition between rolling logs at active NN and purging at standby

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: 2.0.2-alpha
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Standby NN has got the ledgerlist with list of all files, including the inprogress file (with say inprogress_val1)
      Active NN has done finalization and created new inprogress file.
      Standby when proceeds further finds that the inprogress file which it had in the list is not present and NN gets shutdown

      NN Logs
      =========
      2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 201 saved in 0 seconds.
      2012-05-17 22:15:03,874 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /xx.xx.xx.102:8020
      2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 111
      2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_0000000000000000109, cpktTxId=0000000000000000109)
      2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767, stream=null))
      java.io.IOException: Exception reading ledger list from zk
      at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
      at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
      at org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
      at org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
      at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
      at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
      at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
      at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
      Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
      at org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
      at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
      ... 16 more
      2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

      ZK Data
      ================

      [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
      -40;59;116
      cZxid = 0x2be
      ctime = Thu May 17 22:15:03 IST 2012
      mZxid = 0x2be
      mtime = Thu May 17 22:15:03 IST 2012
      pZxid = 0x2be
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 10
      numChildren = 0

      1. HDFS-3441.3.patch
        8 kB
        Uma Maheswara Rao G
      2. HDFS-3441.3.patch
        8 kB
        Rakesh R
      3. HDFS-3441.2.patch
        8 kB
        Rakesh R
      4. HDFS-3441.1.patch
        6 kB
        Rakesh R
      5. HDFS-3441.patch
        4 kB
        Rakesh R

        Activity

        suja s created issue -
        suja s made changes -
        Field Original Value New Value
        Description Standby NN has got the ledgerlist with list of all files, including the inprogress file (with say inprogress_val1)
        Active NN has done finalization and created new inprogress file.
        Standby when proceeds further finds that the inprogress file which it had in the list is not present and NN gets shutdown


        NN Logs
        =========
        2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 201 saved in 0 seconds.
        2012-05-17 22:15:03,874 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /10.18.40.102:8020
        2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 111
        2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_0000000000000000109, cpktTxId=0000000000000000109)
        2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767, stream=null))
        java.io.IOException: Exception reading ledger list from zk
        at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
        at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
        at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
        Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
        at org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
        at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
        ... 16 more
        2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:


        ZK Data
        ================

        [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
        -40;59;116
        cZxid = 0x2be
        ctime = Thu May 17 22:15:03 IST 2012
        mZxid = 0x2be
        mtime = Thu May 17 22:15:03 IST 2012
        pZxid = 0x2be
        cversion = 0
        dataVersion = 0
        aclVersion = 0
        ephemeralOwner = 0x0
        dataLength = 10
        numChildren = 0
        Standby NN has got the ledgerlist with list of all files, including the inprogress file (with say inprogress_val1)
        Active NN has done finalization and created new inprogress file.
        Standby when proceeds further finds that the inprogress file which it had in the list is not present and NN gets shutdown


        NN Logs
        =========
        2012-05-17 22:15:03,867 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 201 saved in 0 seconds.
        2012-05-17 22:15:03,874 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode /xx.xx.xx.102:8020
        2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 111
        2012-05-17 22:15:03,923 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/home/May8/hadoop-3.0.0-SNAPSHOT/hadoop-root/dfs/name/current/fsimage_0000000000000000109, cpktTxId=0000000000000000109)
        2012-05-17 22:15:03,961 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: purgeLogsOlderThan 0 failed for required journal (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@142e6767, stream=null))
        java.io.IOException: Exception reading ledger list from zk
        at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:531)
        at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.purgeLogsOlderThan(BookKeeperJournalManager.java:444)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:541)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:538)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1011)
        at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:98)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:900)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:885)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:822)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:157)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$900(StandbyCheckpointer.java:52)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:279)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:200)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:220)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:512)
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:216)
        Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /nnedits/ledgers/inprogress_72
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1113)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1142)
        at org.apache.hadoop.contrib.bkjournal.EditLogLedgerMetadata.read(EditLogLedgerMetadata.java:113)
        at org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.getLedgerList(BookKeeperJournalManager.java:528)
        ... 16 more
        2012-05-17 22:15:03,963 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:


        ZK Data
        ================

        [zk: xx.xx.xx.55:2182(CONNECTED) 9] get /nnedits/ledgers/inprogress_74
        -40;59;116
        cZxid = 0x2be
        ctime = Thu May 17 22:15:03 IST 2012
        mZxid = 0x2be
        mtime = Thu May 17 22:15:03 IST 2012
        pZxid = 0x2be
        cversion = 0
        dataVersion = 0
        aclVersion = 0
        ephemeralOwner = 0x0
        dataLength = 10
        numChildren = 0
        Rakesh R made changes -
        Attachment HDFS-3441.patch [ 12528084 ]
        Uma Maheswara Rao G made changes -
        Assignee Rakesh R [ rakeshr ]
        Rakesh R made changes -
        Attachment HDFS-3441.1.patch [ 12530189 ]
        Rakesh R made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 2.0.0-alpha [ 12320353 ]
        Fix Version/s 3.0.0 [ 12320356 ]
        Fix Version/s 2.0.0-alpha [ 12320353 ]
        Uma Maheswara Rao G made changes -
        Fix Version/s 2.0.0-alpha [ 12320353 ]
        Fix Version/s 3.0.0 [ 12320356 ]
        Target Version/s 2.0.1-alpha, 3.0.0 [ 12321440, 12320356 ]
        Rakesh R made changes -
        Attachment HDFS-3441.2.patch [ 12530361 ]
        Rakesh R made changes -
        Attachment HDFS-3441.3.patch [ 12530389 ]
        Uma Maheswara Rao G made changes -
        Attachment HDFS-3441.3.patch [ 12530441 ]
        Uma Maheswara Rao G made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Fix Version/s 2.0.1-alpha [ 12321440 ]
        Fix Version/s 3.0.0 [ 12320356 ]
        Resolution Fixed [ 1 ]
        Arun C Murthy made changes -
        Fix Version/s 2.0.2-alpha [ 12322472 ]
        Fix Version/s 2.1.0-alpha [ 12321440 ]
        Arun C Murthy made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Allen Wittenauer made changes -
        Fix Version/s 3.0.0 [ 12320356 ]

          People

          • Assignee:
            Rakesh R
            Reporter:
            suja s
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development