Two bugs in DFS contributed to the problem:
(1). DataNode does not sync on modification to the counter "xmitsInProgress", which keeps track of the number of replication in progress. When two threads update the counter concurrently, race condition may occurs. The counter may change to be a non-zero value when no replication is going on.
(2). Each DN is configured to have at most 2 replications in progress. When DN notifies NN that it has 1 replication in progress, NN should be able to send one block replication request to DN. But NN wrongly interprets the counter as the number of targets. When it sees that the block is scheduled to 2 targets but DN can only take 1, it sends an empty replication request to DN. As a result, blocking all replications from this DataNode. If the DataNode is the only source of an under-replicated block, the block will never get replicated.
Fixing either (1) or (2) could fix the problem. I think (1) is more fundamental so I will fix (1) in this jira and file a different jira to fix (2).