Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11225

NameNode crashed because deleteSnapshot held FSNamesystem lock too long

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.4.0
    • 3.1.0
    • namenode
    • CDH5.8.2, HA

    Description

      The deleteSnapshot operation is synchronous. In certain situations this operation may hold FSNamesystem lock for too long, bringing almost every NameNode operation to a halt.

      We have observed one incidence where it took so long that ZKFC believes the NameNode is down. All other IPC threads were waiting to acquire FSNamesystem lock. This specific deleteSnapshot took ~70 seconds. ZKFC has connection timeout of 45 seconds by default, and if all IPC threads wait for FSNamesystem lock and can't accept new incoming connection, ZKFC times out, advances epoch and NameNode will therefore lose its active NN role and then fail.

      Relevant log:

      Thread 154 (IPC Server handler 86 on 8020):
        State: RUNNABLE
        Blocked count: 2753455
        Waited count: 89201773
        Stack:
          org.apache.hadoop.hdfs.server.namenode.INode$BlocksMapUpdateInfo.addDeleteBlock(INode.java:879)
          org.apache.hadoop.hdfs.server.namenode.INodeFile.destroyAndCollectBlocks(INodeFile.java:508)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.destroyAndCollectBlocks(INodeDirectory.java:763)
          org.apache.hadoop.hdfs.server.namenode.INodeReference.destroyAndCollectBlocks(INodeReference.java:339)
          org.apache.hadoop.hdfs.server.namenode.INodeReference$WithName.destroyAndCollectBlocks(INodeReference.java:606)
          org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.destroyDeletedList(DirectoryWithSnapshotFeature.java:119)
          org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$ChildrenDiff.access$400(DirectoryWithSnapshotFeature.java:61)
          org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:319)
          org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature$DirectoryDiff.destroyDiffAndCollectBlocks(DirectoryWithSnapshotFeature.java:167)
          org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.deleteSnapshotDiff(AbstractINodeDiffList.java:83)
          org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:745)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
          org.apache.hadoop.hdfs.server.namenode.snapshot.DirectoryWithSnapshotFeature.cleanDirectory(DirectoryWithSnapshotFeature.java:747)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:776)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtreeRecursively(INodeDirectory.java:747)
          org.apache.hadoop.hdfs.server.namenode.INodeDirectory.cleanSubtree(INodeDirectory.java:789)
      

      After the ZKFC determined NameNode was down and advanced epoch, the NN finished deleting snapshot, and sent the edit to journal nodes, but it was rejected because epoch was updated. See the following stacktrace:

      10.0.16.21:8485: IPC's epoch 17 is less than the last promised epoch 18
              at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
              at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457)
              at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352)
              at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)
              at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
              at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
      
              at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
              at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:223)
              at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
              at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
              at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
              at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
              at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:8507)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1469)
              at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.deleteSnapshot(AuthorizationProviderProxyClientProtocol.java:717)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.deleteSnapshot(ClientNamenodeProtocolServerSideTranslatorPB.java:1061)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
              at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
              at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
      
      

      Finally NameNode shut itself down because it had too many quorum errors.
      Setting priority to critical because it resulted in NameNode crash.

      We think deleteSnapshot should be made asynchronous. It should delete the root of snapshot directory, and then put the rest of work into an asynchronous thread. Credit: yzhangal for proposing this idea.

      Attachments

        1. Snaphot_Deletion_Design_Proposal.pdf
          84 kB
          Shashikant Banerjee

        Issue Links

          Activity

            People

              shashikant Shashikant Banerjee
              weichiu Wei-Chiu Chuang
              Votes:
              2 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: