Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12369

Edit log corruption due to hard lease recovery of not-closed file which has snapshots

VotersStop watchingWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.9.0, 3.0.0-beta1, 2.8.3
    • namenode
    • None

    Description

      HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations of client operations.

      Recently, we have observed NN not able to start with the following exception:

      2017-08-17 14:32:18,418 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
      java.io.FileNotFoundException: File does not exist: /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M
              at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
              at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
      

      Quoting a nicely analysed edits:

      In the edits logged about 1 hour later, we see this failing OP_CLOSE. The sequence in the edits shows the file going through:

      OPEN
      ADD_BLOCK
      CLOSE
      ADD_BLOCK # perhaps this was an append
      DELETE
      (about 1 hour later) CLOSE

      It is interesting that there was no CLOSE logged before the delete.

      Grepping that file name, it turns out the close was triggered by LeaseManager, when the lease reaches hard limit.

      2017-08-16 15:05:45,927 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
        Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending creates: 75], 
        src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M
      2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
        internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
        /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M closed.
      

      Attachments

        1. HDFS-12369.test.patch
          6 kB
          Xiao Chen
        2. HDFS-12369.03.patch
          7 kB
          Xiao Chen
        3. HDFS-12369.02.patch
          7 kB
          Xiao Chen
        4. HDFS-12369.01.patch
          7 kB
          Xiao Chen

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            xiaochen Xiao Chen
            xiaochen Xiao Chen
            Votes:
            0 Vote for this issue
            Watchers:
            9 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment