[HDFS-9923] Datanode disk failure handling is not consistent - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: datanode
Labels:
- failure-handling
- supportability

Description

Disk failures are hard to handle. This JIRA is created to discuss/improve disk failure handling in a better/consistent manner.

For one thing, disks can fail in multiple different ways: the hardware might be failing, disk space is full, checksum error ... For others, hardware abstracts out the details, so it's hard for software to handle them.

There are currently three disk check mechanisms in HDFS, as far as I know: BlockScanner, BlockPoolSlice#checkDirs and DU. Disk errors are handled differently.

This JIRA is more focused on DU error handling. DU may emit errors like this:

2016-02-18 02:23:36,224 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught exception while scanning /data/8/dfs/dn/current.
Will throw later.
ExitCodeException exitCode=1: du: cannot access `/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir228/subdir11/blk_
1088686909': Input/output error
du: cannot access `/data/8/dfs/dn/current/BP-1018136951-49.4.167.110-1403564146510/current/finalized/subdir228/subdir11/blk_1088686909_14954023.meta': Inp
ut/output error

I found DU errors are not handled consistently while working on ~~HDFS-9908~~ (Datanode should tolerate disk scan failure during NN handshake), and it all depends on who catches the exception.

For example,

if DU returns error during NN handshake, DN will not be able to join the cluster at all(~~HDFS-9908~~);
however, if the same exception is caught in BlockPoolSlice#saveDfsUsed, data node will only log a warning and do nothing (~~HDFS-5498~~).
in some cases, the exception handler invokes BlockPoolSlice#checkDirs, but since it only checks three directories, it is very unlikely to find the files that have the error. BlockReceiver#(constructor)

So my ask is: should the error be handled in a consistent manner? Should data node report to the name nodes about the disk failures (this is the BlockScanner approach), and should data node takes this volume offline automatically if DU returns an error? (this is the checkDirs approach)

Attachments

Issue Links

relates to

HADOOP-8640 DU thread transient failures propagate to callers

Resolved

HDFS-9819 FsVolume should tolerate few times check-dir failed due to deletion by mistake

Resolved

HDFS-9908 Datanode should tolerate disk scan failure during NN handshake

Resolved

HDFS-8845 DiskChecker should not traverse the entire tree

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Wei-Chiu Chuang

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 08/Mar/16 20:15

Updated:: 19/May/16 22:01