[HDFS-12369] Edit log corruption due to hard lease recovery of not-closed file which has snapshots - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.9.0, 3.0.0-beta1, 2.8.3
Component/s: namenode
Labels:
None

Target Version/s:

3.0.0-beta1, 2.7.5
Hadoop Flags:

Reviewed

Description

~~HDFS-6257~~ and ~~HDFS-7707~~ worked hard to prevent corruption from combinations of client operations.

Recently, we have observed NN not able to start with the following exception:

2017-08-17 14:32:18,418 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.io.FileNotFoundException: File does not exist: /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)

Quoting a nicely analysed edits:

In the edits logged about 1 hour later, we see this failing OP_CLOSE. The sequence in the edits shows the file going through:

OPEN
ADD_BLOCK
CLOSE
ADD_BLOCK # perhaps this was an append
DELETE
(about 1 hour later) CLOSE

It is interesting that there was no CLOSE logged before the delete.

Grepping that file name, it turns out the close was triggered by LeaseManager, when the lease reaches hard limit.

2017-08-16 15:05:45,927 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
  Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending creates: 75], 
  src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M
2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
  internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
  /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M closed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-12369.test.patch
28/Aug/17 23:21
6 kB
Xiao Chen
HDFS-12369.03.patch
06/Sep/17 07:01
7 kB
Xiao Chen
HDFS-12369.02.patch
29/Aug/17 22:49
7 kB
Xiao Chen
HDFS-12369.01.patch
29/Aug/17 03:19
7 kB
Xiao Chen

Activity

People

Assignee:: Xiao Chen

Reporter:: Xiao Chen

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 28/Aug/17 22:56

Updated:: 27/Dec/17 23:10

Resolved:: 07/Sep/17 23:31