Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
2.5.0
-
None
-
None
-
None
Description
On a cluster running 2.5, we have observed a decommissioning failure due to a file that had been under construction for 3 days. It turned out that the file was abandoned and a lease recovery was carried out by the name node 3 days ago.
The block recovery failed because the name node threw a quota exception while serving commitBlockSynchronization(). After this failure, no further attempt for recovery was made, leaving the file in under-construction state forever.
Furthermore, the nature of the recovery failure is very strange. Even though snapshot was never used in the cluster, it was trying to record the diff and that required incrementing nsquota by 1. The user happened to ran out of his nsquota at that time, so it failed and caused commitBlockSynchronization() to fail. We do see quota discrepancies occasionally. Probably those were caused by something like this all along?
Few observations:
- Lease recovery did not complete, yet didn't get retried.
- No snapshot was in use, but somehow it went through snapshot-related code path.
- quota update during commitBlockSynchronization() should be done unconditionally.
Attachments
Issue Links
- is duplicated by
-
HDFS-7056 Snapshot support for truncate
- Closed