We found the following problem. When a storage device on a DN fails, NN continues to believe replicas of those blocks on that storage are valid and doesn't schedule replication.
A DN has 12 storage disks. So there is one blockReport for each storage. When a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated is configured to be > 0, NN still considers that DN healthy.
1. A disk failed. All blocks of that disk are removed from DN dataset.
2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff given that is done per storage.
3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.