When you do this:
- Decom a single node.
- Underreplicated count reports all blocks.
- Stop decom.
- Underreplication count reduces slowly and heads to 0.
This is expected behavior of HDFS but while this is happening, utilities like dfshealth.jsp and fsck produce high numbers of underreplicated blocks, and the node is not on the dead/decommissioned nodes list. It's therefore unclear to novice administrators and HDFS newbies whether or not this is a failure condition that needs administrative attention.
Administrators find themselves constantly having to explain the under-replication number when they could be doing better things with their time. And they're constantly getting alarms which can be disregarded, raising fears of a "cry wolf" problem that the real issue gets lost in the noise.
A direct quote from such an administrator:
"When a datanode fails, it's not considered a 'decommissioning', so it does not show up in that list, it just simply kicks on the underrep and we have to hunt through the LIVE list and attempt to find out which node caused the issue. Obviously, we (the community) are not being told on the DEAD list when a node appears (why this information has to be withheld has always been an issue with me, how hard is it to put a date field in the DEAD list?)"
Nevertheless, we should have more information about a dying node instead of seeing a jump in the underrep count from 0 to millions with no real obvious reason. Perhaps add another column saying 'DYING NODE', anything would help.