[HDFS-7208] NN doesn't schedule replication when a DN storage fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.6.0
Component/s: namenode
Labels:
None

Hadoop Flags:

Reviewed

Description

We found the following problem. When a storage device on a DN fails, NN continues to believe replicas of those blocks on that storage are valid and doesn't schedule replication.

A DN has 12 storage disks. So there is one blockReport for each storage. When a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated is configured to be > 0, NN still considers that DN healthy.

1. A disk failed. All blocks of that disk are removed from DN dataset.

2014-10-04 02:11:12,626 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume /data/disk6/dfs/current

2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff given that is done per storage.

2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Disk error on DatanodeRegistration(xx.xx.xx.xxx, datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939): DataNode failed volumes:/data/disk6/dfs/current

3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-7208.patch
13/Oct/14 23:08
19 kB
Ming Ma
HDFS-7208-2.patch
15/Oct/14 04:08
19 kB
Ming Ma
HDFS-7208-3.patch
15/Oct/14 22:39
19 kB
Ming Ma
HDFS-7208-AdMaster.patch
25/Nov/15 07:31
1.0 kB
Liu Zhe

Issue Links

is duplicated by

HDFS-7453 Namenode does not recognize block is missing on a datanode

Resolved

is related to

HDFS-15274 NN doesn't remove the blocks from the failed DatanodeStorageInfo

Patch Available

relates to

HDFS-13945 TestDataNodeVolumeFailure is Flaky

Resolved

HDFS-7355 TestDataNodeVolumeFailure#testUnderReplicationAfterVolFailure fails on Windows, because we cannot deny access to the file owner.

Closed

Activity

People

Assignee:: Ming Ma

Reporter:: Ming Ma

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 08/Oct/14 06:16

Updated:: 13/Apr/20 03:53

Resolved:: 16/Oct/14 03:52