Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations of client operations.
Recently, we have observed NN not able to start with the following exception:
2017-08-17 14:32:18,418 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. java.io.FileNotFoundException: File does not exist: /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
Quoting a nicely analysed edits:
In the edits logged about 1 hour later, we see this failing OP_CLOSE. The sequence in the edits shows the file going through:
OPEN
ADD_BLOCK
CLOSE
ADD_BLOCK # perhaps this was an append
DELETE
(about 1 hour later) CLOSEIt is interesting that there was no CLOSE logged before the delete.
Grepping that file name, it turns out the close was triggered by LeaseManager, when the lease reaches hard limit.
2017-08-16 15:05:45,927 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending creates: 75], src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M closed.