Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
We have a datanode that did not get decommission done for days. It turned out that there was an under replicated block that was never placed in the neededReplication queue and therefore did not get replicated. The following debug line showed the problem:
" DEBUG org.apache.hadoop.dfs.StateChange: UnderReplicationBlocks.update blk_-7437651423871278837_0 curReplicas 8
curExpectedReplicas 10 oldReplicas 9 oldExpectedReplicas 10 curPri 2 oldPri 2"
The block was not in the neededReplication queue, but the update method concluded that the block was under replicated and the priority level did not change, so it did not add the block to the needReplication queue.
The solution is that in stead of using the update method, the name node should use the add method to add the block to the neededReplication queue. The add method guarantees success if the block is indeed under replicated.