Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10729

Improve log message for edit loading failures caused by FS limit checks.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      We encountered a bug where Standby NameNode crashes due to an NPE when loading edits.

      2016-08-05 15:06:00,983 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation AddOp [length=0, inodeId=789272719, path=[path], replication=3, mtime=1470379597935, atime=1470379597935, blockSize=134217728, blocks=[], permissions=<user>:supergroup:rw-r--r--, aclEntries=null, clientName=DFSClient_NONMAPREDUCE_1495395702_1, clientMachine=10.210.119.136, overwrite=true, RpcClientId=a1512eeb-65e4-43dc-8aa8-d7a1af37ed30, RpcCallId=417, storagePolicyId=0, opCode=OP_ADD, txid=4212503758]
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFileEncryptionInfo(FSDirectory.java:2914)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.createFileStatus(FSDirectory.java:2469)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:375)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:810)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:360)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
      

      The NameNode crashes and can not be restarted. After some research, we turned on debug log of org.apache.hadoop.hdfs.StateChange, restart the NN, and we saw the following exception which induced NPE:

      16/08/07 18:51:15 DEBUG hdfs.StateChange: DIR* FSDirectory.unprotectedAddFile: exception when add [path] to the file system
      org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException: The directory item limit of [path] is exceeded: limit=1048576 items=1049332
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2060)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2112)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:2081)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1900)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:368)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:365)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:810)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:188)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:182)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
              at org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:445)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUser(SecurityUtil.java:426)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.catchupDuringFailover(EditLogTailer.java:182)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1205)
              at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1762)
              at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
              at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
              at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1635)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1351)
              at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
              at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
      

      The exception is thrown, caught and logged at debug level in FSDirectory#unprotectedAddFile. Afterwards, a null is returned, which is used and subsequently caused NPE. HDFS-7567 reported a similar bug, but it was deemed the NPE could only be thrown if edit is corrupt. However, here we see an example where NPE could be thrown without corrupt edits.

      Looks like the maximum number of items per directory is exceeded. This is similar to HDFS-6102 and HDFS-7482, but this happens at loading edits, rather than loading fsimage.

      A possible workaround is to increas the value of dfs.namenode.fs-limits.max-directory-items to 6400000. But I am not sure if it would cause any side effects.

      Attachments

        1. HDFS-10729.001.patch
          2 kB
          Wei-Chiu Chuang

        Issue Links

          Activity

            People

              weichiu Wei-Chiu Chuang
              weichiu Wei-Chiu Chuang
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: