For future jira explorers:
HDFS-11445 into CDH, our internal testing caught a regression in it. After tracing the code, I realized the regression is fixed via HDFS-11755.
Specifically, we found (bogus) missing file warnings when running a Solr example application via Hue.
What's interesting is that the JMX shows MissingBlocks > 0, but there is no missing file names; NameNode Web UI also warnings for missing blocks. But fsck result is healthy.
Steps to reproduce:
1. Install a fresh CDH + CM cluster, 4 nodes (with
2. Go to Hue UI, install Solr example.
3. Restart CDH (all services)
For details, this bug seems to happen after writing to a data pipeline. A dfsclient calls FSNamesystem#updatePipeline after it gets acks from the pipeline. However, if FSNamesystem #updatePipeline before DataNodes report IBRs, the block would see zero live replicas (it incorrectly thinks all replicas are stale after the genstamp is updated).
I checked and compared the code in BlockManager#removeStoredBlock and BlockManager#addStoredBlock.
In removeStoredBlock, in addition to removing a DN from a stored block, BlockManager also updates under replication queue (updateNeededReplications); but in addStoredBlock, it does not update under replication queue after adding a DN to a stored block if the file is under construction.
The fix in
HDFS-11755 adds an additional check in BlockManager#removeStoredBlock to skip updating under replication queue when the file is under construction, which fixes the problem.
HDFS-11755 is committed in branch 2.8 ~ trunk before HDFS-11445, this bug is not seen in these branches. But we should backport HDFS-11755 in branch 2.7 to address the regression.