[HADOOP-8640] DU thread transient failures propagate to callers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 2.0.0-alpha, 1.2.1
Fix Version/s: None
Component/s: fs, io
Labels:
None

Description

When running some stress tests, I saw a failure where the DURefreshThread failed due to the filesystem changing underneath it:

org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/data/4/dfs/dn/current/BP-1928785663-172.20.90.20-1343880685858/current/rbw/blk_4637779214690837894': No such file or directory

(the block was probably finalized while the du process was running, which caused it to fail)

The next block write, then, called getUsed(), and the exception got propagated causing the write to fail. Since it was a pseudo-distributed cluster, the client was unable to pick a different node to write to and failed.

The current behavior of propagating the exception to the next (and only the next) caller doesn't seem well-thought-out.

Attachments

Issue Links

is related to

HDFS-9923 Datanode disk failure handling is not consistent

Open

HDFS-9908 Datanode should tolerate disk scan failure during NN handshake

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Todd Lipcon

Votes:: 3 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 02/Aug/12 07:28

Updated:: 12/May/18 05:42

Resolved:: 12/May/18 05:42