Per Andy's comment on
TestDatanodeBlockScanner still fails about 1/5 runs in testBlockCorruptionRecoveryPolicy2. That's due to a separate test issue also uncovered by
The failure scenario for this one is a bit more tricky. I think I've captured the scenario below:
- The test corrupts 2/3 replicas.
- client reports a bad block.
- NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
- DN notices the incoming replica is corrupt and reports it as a bad block, but does not inform the NN that re-replication failed.
- NN keeps the block on pendingReplications.
- BP scanner wakes up on both DNs with corrupt blocks, both report corruption. NN reports both as duplicates, one from the client and one from the DN report above.
since block is on pendingReplications, NN does not schedule another replication.