Resolution: Not a Problem
Affects Version/s: 0.20-append
Fix Version/s: None
- Component: data node
- Version: 0.20-append
- Summary: we found a case that when a block is truncated during updateBlock,
the length on the ongoingCreates is not updated, hence leading to failed append.
- disks / datanode = 3
- failures = 2
failure type = crash
When/where failure happens = (see below)
1) Client writes to dn1-dn2-dn3. Write successes.
2) Now client tried to append. It first call dn1.recoverBlock().This recoverBlock succeeds.
3) Suppose the pipeline is dn3-dn2-dn1. Client sends packet to dn3.
dn3 forwards the packet to dn2 and writes to its disk (i.e dn3's disk).
Now, dn2 crashes, so that dn1 has not received this packet yet.
4) Client calls dn1.recoverBlock() again, this time with dn3-dn1 in the pipeline.
dn1 then calls dn3.startBlockRecovery() to terminate the writer thread in dn3.
get the in memory metadata info of the block, and verify that info with
the real file on disk.
dn3 maintains an in-memory data structure call ongoingCreates to record
information about currently-being-created block. If a block is finalized, then
its info is removed from ongoingCreates.
Now suppose that at the time dn3 receives startBlockRecovery() request from dn1,
+ finished writing data to disk (hence, the block length on disk is 1024)
+ set visible in memory length (hence, in memory length is also 1024)
but it has not finalized the block, hence the block info is still in the ongoingCreates.
(Note: the interruption of writer thread makes the finalization never happens)
Because of all above stuff, dn3 gives dn1 info about the block with length 1024.
5. Now dn1 calls its own startBlockRecovery() successfully (because the on-disk
file length and memory file length match, both are 512 byte).
6. Now, dn1 has a sync list (block_X_GS1 at dn1 with length 512, block_X_GS1 at dn3 with length 1024).
it needs to make sure all dn in the pipeline agree on new GS and length.
dn1 calls NN.nextGS() to get new GS2. It form new block_X_GS2 with length 512, and
call updateBlock on dn3 and itself.
7. dn3, receiving updateBlock request from dn1, does:
+ rename the block from block_X_GS1 ==> block_X_GS2
+ truncate the block file length from 1024 to 512
But, here is the key, it does not update the length of the block kept in ongoingCreates
+ return to dn1 successfully
8. Now, dn1 call its own updateBlock and crashes.
9. From client point of view, dn1.recoverBlock fails.
It retries call dn1.recoverBlock six times, and declare dn1 as bad.
10. Client now calls dn3.recoverBlock()
11. Dn3 in turns calls its startBlockRecovery() to
+ interrupt block writer threads if any
+ getBlockMetadataInfo (as part of forming the syncList, and updateBlock later)
> it first look into ongoingCreates to see the block info is there,
and found it (because the block is not finalized).
Hence, in-memory length is 1024 (even though truncateBlock is called before)
> verify if the in-memory length (1024) with on-disk length (512)
Hence, the un-matched file length exception
12. From client point of view, recoverBlock fails, because All data nodes are bad
Client retries calling dn3.recoverBlock five more times and gets the same exception,
Hence, append fails.
- to fix it, i think when truncating the file, we need to update the ongoingCreates too
(but i am not sure, if we fix thing like this, is there any other workload may affect)
- interestingly, NN.leaseRecovery fails because of the exact exception at dn3.
- until dead node restarts and NN.leaseRecovery is triggered again, NN is still the lease holder of the file
This bug was found by our Failure Testing Service framework:
For questions, please email us: Thanh Do (firstname.lastname@example.org) and
Haryadi Gunawi (email@example.com