[HDFS-9106] Transfer failure during pipeline recovery causes permanent write failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 2.7.2, 3.0.0-alpha1
Component/s: None
Labels:
None

Target Version/s:

2.8.0, 2.6.5
Hadoop Flags:

Reviewed

Description

When a new node is added to a write pipeline during flush/sync, if the partial block transfer fails, the write will fail permanently without retrying or continuing with whatever is in the pipeline.

The transfer often fails in busy clusters due to timeout. There is no per-packet ACK between client and datanode or between source and target datanodes. If the total transfer time exceeds the configured timeout + 10 seconds (2 * 5 seconds slack), it is considered failed. Naturally, the failure rate is higher with bigger block sizes.

I propose following changes:

Transfer timeout needs to be different from per-packet timeout.
transfer should be retried if fails.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-9106-poc.patch
18/Sep/15 17:38
4 kB
Kihwal Lee
HDFS-9106.patch
23/Sep/15 20:13
4 kB
Kihwal Lee
HDFS-9106.branch-2.7.patch
28/Sep/15 20:18
4 kB
Kihwal Lee

Issue Links

is related to

HDFS-1675 Transfer RBW between datanodes

Closed

HDFS-8311 DataStreamer.transfer() should timeout the socket InputStream.

Resolved

Activity

People

Assignee:: Kihwal Lee

Reporter:: Kihwal Lee

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 18/Sep/15 16:27

Updated:: 06/Jan/17 00:55

Resolved:: 28/Sep/15 20:17