[HDFS-15644] Failed volumes can cause DNs to stop block reporting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.2.2, 3.3.1, 3.4.0, 2.10.2, 3.2.3
Component/s: block placement, datanode
Labels:
- refactor

Description

daryn found a corner case where remove failed volumes can cause a NPE in FsDataSetImpl.getBlockReports().

Scenario:

Inside Datanode#HandleVolumeFailures(), removing a failed volume is a 2-step process.
- First it's removed from from the volumes list
- Later in time are the replicas scrubbed from the volume map
A concurrent thread generating blockReports may access the replicaMap accessing a non existing VolumeID.

He made a fix for that and we have been using it on our clusters since Hadoop-2.7.

By analyzing the code, the bug is still applicable to Trunk.

The path Datanode#removeVolumes() is safe because the two step process in FsDataImpl.removeVolumes() FsDatasetImpl.java#L577 is protected by datasetWriteLock .
The path Datanode#handleVolumeFailures() is not safe because the failed volume is removed from the list without acquiring datasetWriteLock.FsVolumList#239

The race condition can cause the caller of getBlockReports() to throw NPE if the RUR is referring to a volume that has already been removed FsDatasetImpl.java#L1976.

        case RUR:
          ReplicaInfo orig = b.getOriginalReplica();
          builders.get(volStorageID).add(orig);
          break;

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-15644.001.patch
20/Oct/20 20:03
2 kB
Ahmed Hussein
HDFS-15644.002.patch
21/Oct/20 16:50
2 kB
Ahmed Hussein
HDFS-15644-branch-2.10.002.patch
26/Oct/20 14:56
2 kB
Ahmed Hussein

Activity

People

Assignee:: Ahmed Hussein

Reporter:: Ahmed Hussein

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 20/Oct/20 18:41

Updated:: 10/Jun/21 07:44

Resolved:: 28/Oct/20 16:24