daryn found a corner case where remove failed volumes can cause a NPE in FsDataSetImpl.getBlockReports().
- Inside Datanode#HandleVolumeFailures(), removing a failed volume is a 2-step process.
- First it's removed from from the volumes list
- Later in time are the replicas scrubbed from the volume map
- A concurrent thread generating blockReports may access the replicaMap accessing a non existing VolumeID.
He made a fix for that and we have been using it on our clusters since Hadoop-2.7.
By analyzing the code, the bug is still applicable to Trunk.
- The path Datanode#removeVolumes() is safe because the two step process in FsDataImpl.removeVolumes() is protected by datasetWriteLock .
- The path Datanode#handleVolumeFailures() is not safe because the failed volume is removed from the list without acquiring datasetWriteLock.FsVolumList#239
The race condition can cause the caller of getBlockReports() to throw NPE if the RUR is referring to a volume that has already been removed
case RUR: ReplicaInfo orig = b.getOriginalReplica(); builders.get(volStorageID).add(orig); break;