Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12369

Edit log corruption due to hard lease recovery of not-closed file which has snapshots

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-beta1, 2.8.3
    • Component/s: namenode
    • Labels:
      None

      Description

      HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations of client operations.

      Recently, we have observed NN not able to start with the following exception:

      2017-08-17 14:32:18,418 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
      java.io.FileNotFoundException: File does not exist: /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M
              at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
              at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
              at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
              at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
              at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
      

      Quoting a nicely analysed edits:

      In the edits logged about 1 hour later, we see this failing OP_CLOSE. The sequence in the edits shows the file going through:

      OPEN
      ADD_BLOCK
      CLOSE
      ADD_BLOCK # perhaps this was an append
      DELETE
      (about 1 hour later) CLOSE

      It is interesting that there was no CLOSE logged before the delete.

      Grepping that file name, it turns out the close was triggered by LeaseManager, when the lease reaches hard limit.

      2017-08-16 15:05:45,927 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
        Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending creates: 75], 
        src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M
      2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
        internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
        /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M closed.
      

        Attachments

        1. HDFS-12369.test.patch
          6 kB
          Xiao Chen
        2. HDFS-12369.03.patch
          7 kB
          Xiao Chen
        3. HDFS-12369.02.patch
          7 kB
          Xiao Chen
        4. HDFS-12369.01.patch
          7 kB
          Xiao Chen

          Activity

            People

            • Assignee:
              xiaochen Xiao Chen
              Reporter:
              xiaochen Xiao Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: