Issue Details (XML | Word | Printable)

Key: HADOOP-1263
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Hairong Kuang
Reporter: Christian Kunz
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

retry logic when dfs exist or open fails temporarily, e.g because of timeout

Created: 17/Apr/07 05:27 AM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: 0.12.3
Fix Version/s: 0.13.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works retry.patch 2007-05-02 12:43 AM Hairong Kuang 8 kB
Text File Licensed for inclusion in ASF works retry1.patch 2007-05-03 07:14 PM Hairong Kuang 9 kB

Resolution Date: 07/May/07 07:43 PM


 Description  « Hide
Sometimes, when many (e.g. 1000+) map jobs start at about the same time and require supporting files from filecache, it happens that some map tasks fail because of rpc timeouts. With only the default number of 10 handlers on the namenode, the probability is high that the whole job fails (see Hadoop-1182). It is much better with a higher number of handlers, but some map tasks still fail.

This could be avoided if rpc clients did retry when encountering a timeout before throwing an exception.

Examples of exceptions:

java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
at org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hairong Kuang made changes - 24/Apr/07 10:52 PM
Field Original Value New Value
Assignee Hairong Kuang [ hairong ]
Hairong Kuang added a comment - 24/Apr/07 11:17 PM
The rpc time out problem is not specific to dfs. Shall we fix it in the ipc package? I am thinking to use an exponential backoff algorithm to reduce the load on the name server.

dhruba borthakur added a comment - 24/Apr/07 11:23 PM
I like the idea of having a small timeout at first and then having exponential backoff so as to not load the server. This is definitely a good thing for "server scalability". If you decide to do it in the ipc package, then all the retry loops in the DFSClient might need to go away.

Tom White added a comment - 25/Apr/07 08:12 AM
We should be able to use the org.apache.hadoop.io.retry package for this. See http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/retry/package-summary.html. There are a number of policies available (http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/retry/RetryPolicies.html), including one that sleeps by an amount proportional to the number of retries. Writing one with exponential backoff would be easy too.

See also HADOOP-601.


Hairong Kuang added a comment - 25/Apr/07 06:50 PM
Tome, thank you so much. Owen also mentioned the retry package to me yesterday. I start to take a look at it and try to figure out how to use it. The info that you provided is definitely helpful.

Hairong Kuang added a comment - 25/Apr/07 09:37 PM
On a second thought, I feel that simply adding a retry mechanism to ipc may not work because not all the server operations are idempotent. One option is to add an additional parameter to each call indicating if the call should be retried or not when time out. But it seems that our current rpc framework is hard to support this feature.

Hairong Kuang added a comment - 25/Apr/07 11:31 PM
The annotation method proposed in HADOOP-601 to provide a general retry framework in rpc seems to be a simple solution, but since it is not implemented, for this jira, I plan to implement the retry mechanism for only ClientProtocol using the retry framework implemented in HADOOP-997. Here are what I plan to do:

1. Add an exponential backoff policy to RetryPolicies.
2. Create a retry proxy for the dfs client using the following method-to-RetryPolicy hashmap:

  • TRY-ONCE-THEN-FAIL: create, addBlock, complete
  • EXPONENTIAL-BACKOFF: open, setReplication, abandonBlock, abandonFileInProgress, reportBadBlocks, exists, isDir, getListing, getHints, renewLease, getStats, getDatanodeReport, getBlockSize, getEditLogSize
  • I still have not decided which retry policy to use for
    (1) rename, delete, mkdirs because a retry following a successful operation at the server side will return false instead of true;
    (2) setSafeMode, refreshNodes, rollEditLog, rollFsImage, finalizeUpgrade, metaSave because I still need time to read the code for these methods.

Any suggestion is welcome!


dhruba borthakur added a comment - 28/Apr/07 12:05 AM
The think the create()and cleanup() RPCs can be retried too. In fact, the DFSClient, in it current incantation, does retry the create() RPC three times and the cleanup() RPC indefinitely. I believe that it is safe to retry them because:

1. The first attempt did not even reach the namenode. The retried second attempt can reach the namenode and successfully complete.
2. The first attempt was processed by the namenode successfully but the response did not reach the dfsclient. The dfsclient can retry and the retries will fail. The entire operation fails anyway.

Either way, it is safe to retry the create() & cleanup() RPCs.


Hairong Kuang added a comment - 01/May/07 08:30 PM
> The think the create()and cleanup() RPCs can be retried too.
I believe that Dhruba meant create() and complete(). +1 on his suggestion.

Hairong Kuang added a comment - 01/May/07 10:00 PM
This patch mainly builds a retry mechanism for DFSClient. For all the proposed operations, it retries at most a configurable times using an exponential backoff algorithm when receiving SocketTimeoutException. It retries at most a configurable times using a fixed interval when receiving AlreadyBeingCreatedException.

It also modifies the io.retry package a litle bit.
1. It allows RetryProxy to take a non-public interface. Thank Tom for making the change.
2. It adds an ExponentialBackoff retry policy to RetryPolicies.


Hairong Kuang made changes - 01/May/07 10:00 PM
Attachment retry.patch [ 12356579 ]
Hairong Kuang made changes - 01/May/07 10:11 PM
Attachment retry.patch [ 12356579 ]
Hairong Kuang made changes - 01/May/07 10:12 PM
Attachment retry.patch [ 12356582 ]
Raghu Angadi added a comment - 01/May/07 10:18 PM
regd namenode.complete() :

Many times this returns "false" to indicate that not all block have been reported and the client is expected to retry complete(). This patch seems to remove this part.


Hairong Kuang added a comment - 02/May/07 12:23 AM
Raghu is right. This patch incorporates his comment.

Hairong Kuang made changes - 02/May/07 12:23 AM
Attachment retry.patch [ 12356591 ]
Hairong Kuang made changes - 02/May/07 12:26 AM
Attachment retry.patch [ 12356582 ]
Hairong Kuang made changes - 02/May/07 12:43 AM
Attachment retry.patch [ 12356592 ]
Hairong Kuang made changes - 02/May/07 12:43 AM
Attachment retry.patch [ 12356591 ]
Tom White added a comment - 02/May/07 04:24 PM
It is worth reviewing the logging, especially regarding the log level/retry count combination. In particular RetryInvocationHandler always logs exceptions at warn level, which is probably wrong. If the operation is to be retried then log at info, and only log at warn when the operation will not be retried again. (Also, it might be nice to change the last log message to say how many times the operation was retried before it finally failed.)

Also, could you add a unit test for ExponentialBackoffRetry please?

I notice that ExponentialBackoffRetry uses a delay that incorporates a random number. This is probably a good idea to avoid the problem you are trying to fix, but since the other policies are deterministic I would make this difference clear in the name: e.g. FuzzyExponentialBackoffRetry.


Doug Cutting added a comment - 02/May/07 04:39 PM
> FuzzyExponentialBackoffRetry

I think that's overkill. "Exponential backoff" is still the standard term, even when randomness is involved (e.g., http://en.wikipedia.org/wiki/Truncated_binary_exponential_backoff).


Sameer Paranjpye added a comment - 02/May/07 05:19 PM
Should the number of attempts and sleep time be configurable?

Maybe we should start with reasonable defaults in code and make these configurable if we discover situations where they don't work. Let's not introduce config variables unless they are required.


dhruba borthakur added a comment - 02/May/07 08:57 PM
Code looks good. One question: what do we do for addBlock()? Is there a way to cleanup the dfslcient retry code for addBlock() and make it part of the IPC-retry mechanism. Especially in the case when this RPC encounters SocketTimeouts.

Also, it would be nice if we can remove these configuration variables from hadoop-default.sh so that the user does not know how to change the timeout/retry settings.


Hairong Kuang added a comment - 03/May/07 07:04 PM
AddBlock is not idempotent so we can not simply retry when getting a timeout.

One solution is to make addBlock idempotent by adding an additional parameter block# indicating which block of the file is requested. FSNamesystem keeps AddBlock history in FileUnderConstruction.

Another solution is to provide a general framework to support retry in ipc. An ipc client assigns a unique id to each request. A retry is sent with the same request id. Server maintains a table keeping track of the recent operation result history per client. When the server receives a request, check the table to see if the request has been serveed or not. If yes, simply return the result; Otherwise, server the request. This solution works whether a request is idempotent or not. But it takes more effort to make it work.

Either way, I do not plan to solve the addBlock problem in this jira.


Hairong Kuang added a comment - 03/May/07 07:14 PM
This patch incorporates the following review comments:
1. Add a junit test case for the exponential backoff retry plicy;
2. Change the logging level in RetryInvocationHandler as described by Tom;
3. Remove configuration variable in hadoop-default.xml. I set the intial retry interval to be 200 milliseconds and the max number of retries to be 5.

Hairong Kuang made changes - 03/May/07 07:14 PM
Attachment retry1.patch [ 12356731 ]
Tom White added a comment - 03/May/07 09:55 PM
+1 (for the latest patch)

> I think that's overkill. "Exponential backoff" is still the standard term

I agree. Thanks for the pointer.


Hairong Kuang added a comment - 03/May/07 10:37 PM
Thank Tom for reviewing it.

Hairong Kuang made changes - 03/May/07 10:37 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Fix Version/s 0.13.0 [ 12312348 ]
dhruba borthakur added a comment - 03/May/07 10:37 PM
+1 Code looks good.


Repository Revision Date User Message
ASF #535963 Mon May 07 19:43:37 UTC 2007 cutting HADOOP-1263. Change DFSClient to retry certain namenode calls with an exponential backoff. Contributed by Hairong.
Files Changed
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/dfs/DFSClient.java
MODIFY /lucene/hadoop/trunk/src/test/org/apache/hadoop/io/retry/TestRetryProxy.java
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/io/retry/RetryPolicies.java
MODIFY /lucene/hadoop/trunk/src/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java
MODIFY /lucene/hadoop/trunk/CHANGES.txt

Doug Cutting added a comment - 07/May/07 07:43 PM
I just committed this. Thanks, Hairong!

Doug Cutting made changes - 07/May/07 07:43 PM
Status Patch Available [ 10002 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Hadoop QA added a comment - 08/May/07 11:24 AM

Doug Cutting made changes - 08/Jun/07 08:40 PM
Status Resolved [ 5 ] Closed [ 6 ]
Raghu Angadi added a comment - 13/Jun/07 06:24 AM
I am planning to use this framework for some new RPCs I am adding. I just want to confirm if my understanding is correct: This patch adds a random exponential back off timeout starting with 400 milliseconds for 5 times. In all 5 retries, this add a max of 12 seconds. Since client RPC timeout is 60sec, time it takes for such RPC to fail takes between 300-312 seconds over 6 attempts. Is this expected?, because it is not exponential back off but essentially constant timeout of around 60sec for each retry.

Raghu Angadi added a comment - 13/Jun/07 07:34 AM
> between 300-312 seconds over 6 attempts.
It should be 360-385 sec over 7 attempts. ( It retries maxRetries(5)+1 times starting with timeout of "initialTimeout(200) * 2". ).

Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]