[HDFS-2378] recoverBlock timeout in DFSClient should be longer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Duplicate
Affects Version/s: 0.23.0, 1.1.0
Fix Version/s: None
Component/s: hdfs-client
Labels:
None

Description

In a failure scenario when one of the datanodes in a pipeline has "frozen" (eg hard swapping or disk controller issues) we sometimes see timeouts in the call to recoverBlock(). This is because recoverBlock's implementation sends several RPCs internally (to the NN and to other nodes in the pipeline) with the same timeout. Since the timeouts are equal, the "outer" call times out first. The retry then fails since recovery is already in progress, or already finished.

The best fix would be to make recoverBlock idempotent so the retry doesn't fail, but in the absence of that we can likely fix this issue by increasing the timeout to be equal to the sum of the timeouts of the underlying recovery calls.

Attachments

Issue Links

is duplicated by

HDFS-2637 The rpc timeout for block recovery is too low

Closed

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 28/Sep/11 04:15

Updated:: 28/Sep/15 21:08

Resolved:: 07/Dec/11 03:10