Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10729

Improve log message for edit loading failures caused by FS limit checks.

    Details

    • Hadoop Flags:
      Reviewed

      Description

      We encountered a bug where Standby NameNode crashes due to an NPE when loading edits.

      2016-08-05 15:06:00,983 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation AddOp [length=0, inodeId=789272719, path=[path], replication=3, mtime=1470379597935, atime=1470379597935, blockSize=134217728, blocks=[], permissions=<user>:supergroup:rw-r--r--, aclEntries=null, clientName=DFSClient_NONMAPREDUCE_1495395702_1, clientMachine=10.210.119.136, overwrite=true, RpcClientId=a1512eeb-65e4-43dc-8aa8-d7a1af37ed30, RpcCallId=417, storagePolicyId=0, opCode=OP_ADD, txid=4212503758]
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFileEncryptionInfo(FSDirectory.java:2914)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.createFileStatus(FSDirectory.java:2469)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:375)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:810)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:360)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1651)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:410)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
      

      The NameNode crashes and can not be restarted. After some research, we turned on debug log of org.apache.hadoop.hdfs.StateChange, restart the NN, and we saw the following exception which induced NPE:

      16/08/07 18:51:15 DEBUG hdfs.StateChange: DIR* FSDirectory.unprotectedAddFile: exception when add [path] to the file system
      org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException: The directory item limit of [path] is exceeded: limit=1048576 items=1049332
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2060)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2112)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addLastINode(FSDirectory.java:2081)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addINode(FSDirectory.java:1900)
              at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:368)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:365)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:810)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:188)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:182)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
              at org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:445)
              at org.apache.hadoop.security.SecurityUtil.doAsLoginUser(SecurityUtil.java:426)
              at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.catchupDuringFailover(EditLogTailer.java:182)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1205)
              at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1762)
              at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
              at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
              at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1635)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1351)
              at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
              at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:415)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
      

      The exception is thrown, caught and logged at debug level in FSDirectory#unprotectedAddFile. Afterwards, a null is returned, which is used and subsequently caused NPE. HDFS-7567 reported a similar bug, but it was deemed the NPE could only be thrown if edit is corrupt. However, here we see an example where NPE could be thrown without corrupt edits.

      Looks like the maximum number of items per directory is exceeded. This is similar to HDFS-6102 and HDFS-7482, but this happens at loading edits, rather than loading fsimage.

      A possible workaround is to increas the value of dfs.namenode.fs-limits.max-directory-items to 6400000. But I am not sure if it would cause any side effects.

      1. HDFS-10729.001.patch
        2 kB
        Wei-Chiu Chuang

        Issue Links

          Activity

          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Thanks Yongjun Zhang for the suggestion of improving log message and thanks Kihwal Lee for reviewing and committing the patch!

          Show
          jojochuang Wei-Chiu Chuang added a comment - Thanks Yongjun Zhang for the suggestion of improving log message and thanks Kihwal Lee for reviewing and committing the patch!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10379 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10379/)
          HDFS-10729. Improve log message for edit loading failures caused by FS (kihwal: rev 01721dd88ee532d20eda841254437da4dfd69db5)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirWriteFileOp.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10379 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10379/ ) HDFS-10729 . Improve log message for edit loading failures caused by FS (kihwal: rev 01721dd88ee532d20eda841254437da4dfd69db5) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirWriteFileOp.java
          Hide
          kihwal Kihwal Lee added a comment -

          Wei-Chiu Chuang, thanks for the patch. I've committed this to trunk, branch-2 and branch-2.8.

          Show
          kihwal Kihwal Lee added a comment - Wei-Chiu Chuang , thanks for the patch. I've committed this to trunk, branch-2 and branch-2.8.
          Hide
          kihwal Kihwal Lee added a comment -

          +1 the patch looks good.

          Show
          kihwal Kihwal Lee added a comment - +1 the patch looks good.
          Hide
          kihwal Kihwal Lee added a comment -

          They pass for me too.

          -------------------------------------------------------
           T E S T S
          -------------------------------------------------------
          Running org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby
          Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 31.175 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby
          Running org.apache.hadoop.hdfs.TestLeaseRecovery2
          Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 84.721 sec - in org.apache.hadoop.hdfs.TestLeaseRecovery2
          Running org.apache.hadoop.hdfs.TestFileCorruption
          Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.311 sec - in org.apache.hadoop.hdfs.TestFileCorruption
          Running org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser
          Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.407 sec - in org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser
          
          Results :
          
          Tests run: 19, Failures: 0, Errors: 0, Skipped: 0
          
          Show
          kihwal Kihwal Lee added a comment - They pass for me too. ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 31.175 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby Running org.apache.hadoop.hdfs.TestLeaseRecovery2 Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 84.721 sec - in org.apache.hadoop.hdfs.TestLeaseRecovery2 Running org.apache.hadoop.hdfs.TestFileCorruption Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.311 sec - in org.apache.hadoop.hdfs.TestFileCorruption Running org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.407 sec - in org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser Results : Tests run: 19, Failures: 0, Errors: 0, Skipped: 0
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          All failed tests passed locally. Kihwal Lee would you like to take a look again? Thanks

          Show
          jojochuang Wei-Chiu Chuang added a comment - All failed tests passed locally. Kihwal Lee would you like to take a look again? Thanks
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 7m 16s trunk passed
          +1 compile 0m 47s trunk passed
          +1 checkstyle 0m 25s trunk passed
          +1 mvnsite 0m 54s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 56s trunk passed
          +1 javadoc 1m 1s trunk passed
          +1 mvninstall 1m 3s the patch passed
          +1 compile 1m 0s the patch passed
          +1 javac 1m 0s the patch passed
          +1 checkstyle 0m 27s the patch passed
          +1 mvnsite 1m 5s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 59s the patch passed
          +1 javadoc 0m 57s the patch passed
          -1 unit 89m 0s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 21s The patch does not generate ASF License warnings.
          110m 21s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.ha.TestBootstrapStandby
            hadoop.hdfs.security.TestDelegationTokenForProxyUser
            hadoop.hdfs.TestFileCorruption
          Timed out junit tests org.apache.hadoop.hdfs.TestLeaseRecovery2



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12823767/HDFS-10729.001.patch
          JIRA Issue HDFS-10729
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b09705f97c13 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / ffe1fff
          Default Java 1.8.0_101
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/16439/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16439/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16439/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 7m 16s trunk passed +1 compile 0m 47s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 54s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 56s trunk passed +1 javadoc 1m 1s trunk passed +1 mvninstall 1m 3s the patch passed +1 compile 1m 0s the patch passed +1 javac 1m 0s the patch passed +1 checkstyle 0m 27s the patch passed +1 mvnsite 1m 5s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 59s the patch passed +1 javadoc 0m 57s the patch passed -1 unit 89m 0s hadoop-hdfs in the patch failed. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 110m 21s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.ha.TestBootstrapStandby   hadoop.hdfs.security.TestDelegationTokenForProxyUser   hadoop.hdfs.TestFileCorruption Timed out junit tests org.apache.hadoop.hdfs.TestLeaseRecovery2 Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12823767/HDFS-10729.001.patch JIRA Issue HDFS-10729 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b09705f97c13 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ffe1fff Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/16439/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/16439/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/16439/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          I just hit the submit patch button.

          Show
          kihwal Kihwal Lee added a comment - I just hit the submit patch button.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Attach v01 patch. Changed the log level from debug to warn. Whenever this exception is thrown in FSDirWriteFileOp#addFileForEditLog, something really bad is happening, and NameNode will eventually abort, so it makes sense to me to log it at WARN. Additionally, add a hint to the log the property that can be used to workaround the error.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Attach v01 patch. Changed the log level from debug to warn. Whenever this exception is thrown in FSDirWriteFileOp#addFileForEditLog, something really bad is happening, and NameNode will eventually abort, so it makes sense to me to log it at WARN. Additionally, add a hint to the log the property that can be used to workaround the error.
          Hide
          jojochuang Wei-Chiu Chuang added a comment - - edited

          Hi Kihwal, thanks for the comment.
          The SbNN crashed when it started and loaded edits. There was a point in time when dfs.namenode.fs-limits.max-directory-items was increased, which explains why the number of items exceeds 1048576 by a lot, but maybe it was only changed in ANN but not SbNN. Or maybe it was increased and then decreased.

          I saw tremendous amount of rename operations before the MaxDirectoryItemsExceededException, but the version the cluster is running has the HDFS-6099 fix.

          At this point it looks more like an operational error, although I think the NPE should be improved. I don't know if NameNodes should terminate when it sees max directory items is violated, but at least it should terminate more gracefully.

          Show
          jojochuang Wei-Chiu Chuang added a comment - - edited Hi Kihwal, thanks for the comment. The SbNN crashed when it started and loaded edits. There was a point in time when dfs.namenode.fs-limits.max-directory-items was increased, which explains why the number of items exceeds 1048576 by a lot, but maybe it was only changed in ANN but not SbNN. Or maybe it was increased and then decreased. I saw tremendous amount of rename operations before the MaxDirectoryItemsExceededException , but the version the cluster is running has the HDFS-6099 fix. At this point it looks more like an operational error, although I think the NPE should be improved. I don't know if NameNodes should terminate when it sees max directory items is violated, but at least it should terminate more gracefully.
          Hide
          kihwal Kihwal Lee added a comment -

          Do you know how it got into this state? If the serving namenode had the same config, this op should have failed and wouldn't have logged in the first place. Do you suspect a bug in enforcement similar to HDFS-6099?

          Show
          kihwal Kihwal Lee added a comment - Do you know how it got into this state? If the serving namenode had the same config, this op should have failed and wouldn't have logged in the first place. Do you suspect a bug in enforcement similar to HDFS-6099 ?
          Hide
          yzhangal Yongjun Zhang added a comment -

          Hi Wei-Chiu Chuang,

          Thanks for reporting the issue and finding the cause. It sounds to me a simple supportability fix would to make the message clear about why the NPE (exceeding the max num of file within a directory), and state that the corresponding config param can be increased to make it pass.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Hi Wei-Chiu Chuang , Thanks for reporting the issue and finding the cause. It sounds to me a simple supportability fix would to make the message clear about why the NPE (exceeding the max num of file within a directory), and state that the corresponding config param can be increased to make it pass. Thanks.

            People

            • Assignee:
              jojochuang Wei-Chiu Chuang
              Reporter:
              jojochuang Wei-Chiu Chuang
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development