I am currently experincing a looping situation;
The namenode uses appx 1:50 (min:sec) to log a massive amount of lines stating that some blocks don't belong to any file. During this time, it's unresponsive to any requests from datanodes, and if the zoo-keper had been running, it would have taken the name-node down (ssh-fencing : kill).
When it has finished the 'round', it starts to do some normal work, and among other things, telling the datanode to delete the blocks. But before the datanode has gotten around to delete the blocks, and is about to report back to the namenode, the namenode has stared on the next round of reporing the same blocks that don't belong to anly file. Thus, the datanode gets a timout when reporing block-updates for the deleted blocks, And this, of course repeats itself over and over again...
There is actually two issues , I think,;
1- the namenode gets totally unresponsive when reporing the blocks (could this be a debug-line instead of a INFO-line)
2 - the namenode seems to 'forget' that it has already reported those blocks just 2-3 minutes ago...