Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15746

Standby NameNode crash when replay editlog

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • namenode
    • None

    Description

      Standby NameNode meet NPE and crash when replay editlog, After dig log and source code, Not found the root cause. But some information may be useful for this case.
      a. before SBN crash, ANN do one lease recovery.

      2020-12-23 12:37:45,946 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: $PATH has not been closed. Lease recovery is in progress. RecoveryId = 21696709510 for block blk_*_21658833701
      

      b. then one Datanode Volumn failed which manage one replica of blk_*_21658833701 after lease recovery.
      c. after half one hour, SBN crash because NPE as following.

      2020-12-23 13:13:36,703 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=$PATH, replication=3, mtime=1608698268201, atime=1608343529481, blockSize=268435456, blocks=[blk_$i_$j], permissions=user:group:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=$txid]
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:360)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
      2020-12-23 13:13:36,703 ERROR org.apache.hadoop.ipc.Server: Error in Reader
      java.nio.channels.ClosedChannelException
              at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197)
              at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:1053)
              at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:1034)
      2020-12-23 13:13:36,703 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.16.39.26:50010 is added to blk_22374572883_21672067156 size 58762255
      2020-12-23 13:13:36,704 FATAL org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN.
      java.io.IOException: java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:254)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:152)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:843)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:360)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1706)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:428)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
      Caused by: java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.setGenerationStampAndVerifyReplicas(BlockInfo.java:455)
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.commitBlock(BlockInfo.java:476)
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1248)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1065)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:244)
              ... 12 more
      

      Not very clear about the relation between [lease recovery/volumn failed/sbn crash], but I think we should catch null when remove stale Replicas to avoid this fatal.
      Our production version is 2.*, and IMO this issue also exist at trunk.

      Attachments

        1. HDFS-15746.001.patch
          1 kB
          Xiaoqiao He

        Activity

          People

            hexiaoqiao Xiaoqiao He
            hexiaoqiao Xiaoqiao He
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: