[HDFS-10777] DataNode should report&remove volume failures if DU cannot access files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: 2.8.0
Fix Version/s: None
Component/s: datanode
Labels:
None

Description

~~HADOOP-12973~~ refactored DU and makes it pluggable. The refactory has a side-effect that if DU encounters an exception, the exception is caught, logged and ignored, essentially fixes ~~HDFS-9908~~ (in which case runaway exceptions prevent DataNodes from handshaking with NameNodes).

However, this "fix" is not good, in the sense that if the disk is bad, there is no immediate action made by the DataNode other than logging the exception. Existing FsDatasetSpi#checkDataDir has been reduced to only check a few number of directories blindly. If a disk goes bad, it is often possible that only a few files are bad initially and that by checking only a small number of directories it is easy to overlook the degraded disk.

I propose: in addition to logging the exception, DataNode should proactively verify the files are not accessible, remove the volume, and make the failure visible by showing it in JMX, so that administrators can spot the failure via monitoring systems.

A different fix, based on ~~HDFS-9908~~, is needed before Hadoop 2.8.0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-10777.01.patch
24/Aug/16 18:18
11 kB
Wei-Chiu Chuang

Issue Links

relates to

HADOOP-12973 make DU pluggable

Resolved

HDFS-9908 Datanode should tolerate disk scan failure during NN handshake

Resolved

Activity

People

Assignee:: Wei-Chiu Chuang

Reporter:: Wei-Chiu Chuang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Aug/16 20:48

Updated:: 16/Sep/16 21:50

Resolved:: 16/Sep/16 21:49