[HDFS-7809] Block and lease recovery failure caused by snapshot issue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.5.0
Fix Version/s: None
Component/s: None
Labels:
None

Target Version/s:

2.7.0

Description

On a cluster running 2.5, we have observed a decommissioning failure due to a file that had been under construction for 3 days. It turned out that the file was abandoned and a lease recovery was carried out by the name node 3 days ago.

The block recovery failed because the name node threw a quota exception while serving commitBlockSynchronization(). After this failure, no further attempt for recovery was made, leaving the file in under-construction state forever.

Furthermore, the nature of the recovery failure is very strange. Even though snapshot was never used in the cluster, it was trying to record the diff and that required incrementing nsquota by 1. The user happened to ran out of his nsquota at that time, so it failed and caused commitBlockSynchronization() to fail. We do see quota discrepancies occasionally. Probably those were caused by something like this all along?

Few observations:

Lease recovery did not complete, yet didn't get retried.
No snapshot was in use, but somehow it went through snapshot-related code path.
quota update during commitBlockSynchronization() should be done unconditionally.

Attachments

Issue Links

is duplicated by

HDFS-7056 Snapshot support for truncate

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Kihwal Lee

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 18/Feb/15 15:40

Updated:: 10/Apr/15 20:30

Resolved:: 18/Feb/15 20:39