Konstantin Shvachko, thanks for your nice discussion. I took further look and found more:
but I don't know what is the reason for postponing replica deletion. Postponing should probably be avoided in this case, since the commitBlockSync() is as good as block report for the particular block.
The result is as following:
- The postponing log in invalidateBlock called by markBlockAsCorrupt is not by result of block recovery, and it's by result of block report from dn0 after it restarts.
- Because as discussed, we will call setDataNodeDead after dn0 shutdown, it results in NN checks heartbeat and finds dn0 dead then remove dn0. Later we do truncate, block recovery happens, and genStamp of last block is increased. Then dn0 restarts successfully and block report happens, but it's replica of last block has old genStamp, so it will be marked as corrupted, when NN invalidates it's block replica, nr.replicasOnStaleNodes() > 0 is true, which is because DatanodeStorageInfo#blockContentsStale for dn0 in NN is true at this time. blockContentsStale is set to false in DatanodeStorageInfo#receivedBlockReport which is called after we finish processing block report in BlockManager.
otherwise NN can schedule recovery (with 1/3 probablity) on the dead DN as a primary. It will wait for the heartbeat from it, but will never receive it because it is down.
Yes, the checkBlockRecovery() timeout is because of this. Currently when starting block recovery, NN chooses a DN (we say dn0) as primary, dn0 can get the BlockRecoveryCommand through the response of it's heartbeat to NN. but if the dn0 is dead, blockrecovery will not happen. Until expiredHardLimit is true in LeaseManager#checkLeases, then a new DN will be chosen as the primary and schedule a new block recovery for that last block.
You need to call setDataNodeDead() after shutdown()
Our modification to testTruncateWithDataNodesRestart is to assert replicas is decreased by 1 because replica of last block on dn0 should be marked as corrupt in NN after block recovery.
If we set setDataNodeDead after dn0 shutdown and before truncate, the replicas is already decreased by 1, then new added assertion doesn't make sense.
BTW, your change completely eliminates the failure of testTruncateWithDataNodesRestartImmediately() from
HDFS-7886, which I ran without triggerBlockReports().