Description
Lease recovery incorrectly handles UC files if the last block is complete but the penultimate block is committed. Incorrectly handles is the euphemism for infinitely loops for days and leaves all abandoned streams open until customers complain.
The problem may manifest when:
- Block1 is committed but seemingly never completed
- Block2 is allocated
- Lease recovery is initiated for block2
- Commit block synchronization invokes FSNamesytem#closeFileCommitBlocks, causing:
- commitOrCompleteLastBlock to mark block2 as complete
- finalizeINodeFileUnderConstruction/INodeFile.assertAllBlocksComplete to throw IllegalStateException because the penultimate block1 is "COMMITTED but not COMPLETE"
- The next lease recovery results in an infinite loop.
The LeaseManager expects that FSNamesystem#internalReleaseLease will either init recovery and renew the lease, or remove the lease. In the described state it does neither. The switch case will break out if the last block is complete. (The case statement ironically contains an assert). Since nothing changed, the lease is still the “next” lease to be processed. The lease monitor loops for 25ms on the same lease, sleeps for 2s, loops on it again.
Attachments
Issue Links
- is broken by
-
HDFS-11445 FSCK shows overall health status as corrupt even one replica is corrupt
- Resolved
- is caused by
-
HDFS-14429 Block remain in COMMITTED but not COMPLETE caused by Decommission
- Resolved
- is related to
-
HDFS-11499 Decommissioning stuck because of failing recovery
- Resolved