In my test looks like the reason of the timeout is a race scenario in the block recovery process: the second dn sends block report after the block truncation is finished thus its replica is marked as corrupted. However the replication monitor cannot schedule an extra replica because there are only 3 datanodes in the test.
You are right. I knew one is corrupted, didn't know it's the second one. Thank you for thorough analysis!
What I'm doing in 01 patch is to trigger the second time blockReport so the corrupted block can get deleted on dn1. So ReplicationMonitor can schedule copying the block to dn1.
To trigger block report or not before restarting DataNodes...
That's not what I do. In the 01 patch, checkBlockRecovery(p) will make sure truncation is completed. triggerBlockReports() is for second time blockReport.
oldBlock.getBlock().getGenerationStamp() + 1);
DFSTestUtil.waitReplication(fs, p, REPLICATION);
it is very possible that the truncate can be done before the restarting.
That's very unlikely, because fs.truncate(p, newLength); is non-blocking.
boolean isReady = fs.truncate(p, newLength); assertFalse(isReady);
cluster.restartDataNode(dn0, true, true); cluster.restartDataNode(dn1, true, true); cluster.waitActive();
So maybe a quick fix is to change the total number of DN from 3 to 4.
It works too. I prefer my approach. Even though with my approach the time spending on DFSTestUtil.waitReplication(..) is 4-6 seconds longer. (waiting deletion and copy)
It worth it. Because the purpose of the test case is to schedule block recovery to dn0/dn1, which got restarted. Increasing the number of DNs will lower the chance.
Uploaded 02 patch. add Thread.sleep(2000) to make sure it's the second BR.