Hadoop Common
  1. Hadoop Common
  2. HADOOP-1263

retry logic when dfs exist or open fails temporarily, e.g because of timeout

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.12.3
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      Sometimes, when many (e.g. 1000+) map jobs start at about the same time and require supporting files from filecache, it happens that some map tasks fail because of rpc timeouts. With only the default number of 10 handlers on the namenode, the probability is high that the whole job fails (see Hadoop-1182). It is much better with a higher number of handlers, but some map tasks still fail.

      This could be avoided if rpc clients did retry when encountering a timeout before throwing an exception.

      Examples of exceptions:

      java.net.SocketTimeoutException: timed out waiting for rpc response
      at org.apache.hadoop.ipc.Client.call(Client.java:473)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
      at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
      at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
      at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
      at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
      at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
      at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
      at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
      at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
      at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
      at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
      at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

      java.net.SocketTimeoutException: timed out waiting for rpc response
      at org.apache.hadoop.ipc.Client.call(Client.java:473)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
      at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
      at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
      at org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
      at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
      at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
      at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
      at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
      at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
      at org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
      at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
      at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
      at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
      at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
      at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
      at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

      1. retry1.patch
        9 kB
        Hairong Kuang
      2. retry.patch
        8 kB
        Hairong Kuang

        Activity

        Hide
        Raghu Angadi added a comment -

        > between 300-312 seconds over 6 attempts.
        It should be 360-385 sec over 7 attempts. ( It retries maxRetries(5)+1 times starting with timeout of "initialTimeout(200) * 2". ).

        Show
        Raghu Angadi added a comment - > between 300-312 seconds over 6 attempts. It should be 360-385 sec over 7 attempts. ( It retries maxRetries(5)+1 times starting with timeout of "initialTimeout(200) * 2". ).
        Hide
        Raghu Angadi added a comment -

        I am planning to use this framework for some new RPCs I am adding. I just want to confirm if my understanding is correct: This patch adds a random exponential back off timeout starting with 400 milliseconds for 5 times. In all 5 retries, this add a max of 12 seconds. Since client RPC timeout is 60sec, time it takes for such RPC to fail takes between 300-312 seconds over 6 attempts. Is this expected?, because it is not exponential back off but essentially constant timeout of around 60sec for each retry.

        Show
        Raghu Angadi added a comment - I am planning to use this framework for some new RPCs I am adding. I just want to confirm if my understanding is correct: This patch adds a random exponential back off timeout starting with 400 milliseconds for 5 times. In all 5 retries, this add a max of 12 seconds. Since client RPC timeout is 60sec, time it takes for such RPC to fail takes between 300-312 seconds over 6 attempts. Is this expected?, because it is not exponential back off but essentially constant timeout of around 60sec for each retry.
        Hide
        Hadoop QA added a comment -
        Show
        Hadoop QA added a comment - Integrated in Hadoop-Nightly #82 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/82/ )
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Hairong!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Hairong!
        Show
        Hadoop QA added a comment - +1 http://issues.apache.org/jira/secure/attachment/12356731/retry1.patch applied and successfully tested against trunk revision r534975. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/114/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/114/console
        Hide
        dhruba borthakur added a comment -

        +1 Code looks good.

        Show
        dhruba borthakur added a comment - +1 Code looks good.
        Hide
        Hairong Kuang added a comment -

        Thank Tom for reviewing it.

        Show
        Hairong Kuang added a comment - Thank Tom for reviewing it.
        Hide
        Tom White added a comment -

        +1 (for the latest patch)

        > I think that's overkill. "Exponential backoff" is still the standard term

        I agree. Thanks for the pointer.

        Show
        Tom White added a comment - +1 (for the latest patch) > I think that's overkill. "Exponential backoff" is still the standard term I agree. Thanks for the pointer.
        Hide
        Hairong Kuang added a comment -

        This patch incorporates the following review comments:
        1. Add a junit test case for the exponential backoff retry plicy;
        2. Change the logging level in RetryInvocationHandler as described by Tom;
        3. Remove configuration variable in hadoop-default.xml. I set the intial retry interval to be 200 milliseconds and the max number of retries to be 5.

        Show
        Hairong Kuang added a comment - This patch incorporates the following review comments: 1. Add a junit test case for the exponential backoff retry plicy; 2. Change the logging level in RetryInvocationHandler as described by Tom; 3. Remove configuration variable in hadoop-default.xml. I set the intial retry interval to be 200 milliseconds and the max number of retries to be 5.
        Hide
        Hairong Kuang added a comment -

        AddBlock is not idempotent so we can not simply retry when getting a timeout.

        One solution is to make addBlock idempotent by adding an additional parameter block# indicating which block of the file is requested. FSNamesystem keeps AddBlock history in FileUnderConstruction.

        Another solution is to provide a general framework to support retry in ipc. An ipc client assigns a unique id to each request. A retry is sent with the same request id. Server maintains a table keeping track of the recent operation result history per client. When the server receives a request, check the table to see if the request has been serveed or not. If yes, simply return the result; Otherwise, server the request. This solution works whether a request is idempotent or not. But it takes more effort to make it work.

        Either way, I do not plan to solve the addBlock problem in this jira.

        Show
        Hairong Kuang added a comment - AddBlock is not idempotent so we can not simply retry when getting a timeout. One solution is to make addBlock idempotent by adding an additional parameter block# indicating which block of the file is requested. FSNamesystem keeps AddBlock history in FileUnderConstruction. Another solution is to provide a general framework to support retry in ipc. An ipc client assigns a unique id to each request. A retry is sent with the same request id. Server maintains a table keeping track of the recent operation result history per client. When the server receives a request, check the table to see if the request has been serveed or not. If yes, simply return the result; Otherwise, server the request. This solution works whether a request is idempotent or not. But it takes more effort to make it work. Either way, I do not plan to solve the addBlock problem in this jira.
        Hide
        dhruba borthakur added a comment -

        Code looks good. One question: what do we do for addBlock()? Is there a way to cleanup the dfslcient retry code for addBlock() and make it part of the IPC-retry mechanism. Especially in the case when this RPC encounters SocketTimeouts.

        Also, it would be nice if we can remove these configuration variables from hadoop-default.sh so that the user does not know how to change the timeout/retry settings.

        Show
        dhruba borthakur added a comment - Code looks good. One question: what do we do for addBlock()? Is there a way to cleanup the dfslcient retry code for addBlock() and make it part of the IPC-retry mechanism. Especially in the case when this RPC encounters SocketTimeouts. Also, it would be nice if we can remove these configuration variables from hadoop-default.sh so that the user does not know how to change the timeout/retry settings.
        Hide
        Sameer Paranjpye added a comment -

        Should the number of attempts and sleep time be configurable?

        Maybe we should start with reasonable defaults in code and make these configurable if we discover situations where they don't work. Let's not introduce config variables unless they are required.

        Show
        Sameer Paranjpye added a comment - Should the number of attempts and sleep time be configurable? Maybe we should start with reasonable defaults in code and make these configurable if we discover situations where they don't work. Let's not introduce config variables unless they are required.
        Hide
        Doug Cutting added a comment -

        > FuzzyExponentialBackoffRetry

        I think that's overkill. "Exponential backoff" is still the standard term, even when randomness is involved (e.g., http://en.wikipedia.org/wiki/Truncated_binary_exponential_backoff).

        Show
        Doug Cutting added a comment - > FuzzyExponentialBackoffRetry I think that's overkill. "Exponential backoff" is still the standard term, even when randomness is involved (e.g., http://en.wikipedia.org/wiki/Truncated_binary_exponential_backoff ).
        Hide
        Tom White added a comment -

        It is worth reviewing the logging, especially regarding the log level/retry count combination. In particular RetryInvocationHandler always logs exceptions at warn level, which is probably wrong. If the operation is to be retried then log at info, and only log at warn when the operation will not be retried again. (Also, it might be nice to change the last log message to say how many times the operation was retried before it finally failed.)

        Also, could you add a unit test for ExponentialBackoffRetry please?

        I notice that ExponentialBackoffRetry uses a delay that incorporates a random number. This is probably a good idea to avoid the problem you are trying to fix, but since the other policies are deterministic I would make this difference clear in the name: e.g. FuzzyExponentialBackoffRetry.

        Show
        Tom White added a comment - It is worth reviewing the logging, especially regarding the log level/retry count combination. In particular RetryInvocationHandler always logs exceptions at warn level, which is probably wrong. If the operation is to be retried then log at info, and only log at warn when the operation will not be retried again. (Also, it might be nice to change the last log message to say how many times the operation was retried before it finally failed.) Also, could you add a unit test for ExponentialBackoffRetry please? I notice that ExponentialBackoffRetry uses a delay that incorporates a random number. This is probably a good idea to avoid the problem you are trying to fix, but since the other policies are deterministic I would make this difference clear in the name: e.g. FuzzyExponentialBackoffRetry.
        Hide
        Hairong Kuang added a comment -

        Raghu is right. This patch incorporates his comment.

        Show
        Hairong Kuang added a comment - Raghu is right. This patch incorporates his comment.
        Hide
        Raghu Angadi added a comment -

        regd namenode.complete() :

        Many times this returns "false" to indicate that not all block have been reported and the client is expected to retry complete(). This patch seems to remove this part.

        Show
        Raghu Angadi added a comment - regd namenode.complete() : Many times this returns "false" to indicate that not all block have been reported and the client is expected to retry complete(). This patch seems to remove this part.
        Hide
        Hairong Kuang added a comment -

        This patch mainly builds a retry mechanism for DFSClient. For all the proposed operations, it retries at most a configurable times using an exponential backoff algorithm when receiving SocketTimeoutException. It retries at most a configurable times using a fixed interval when receiving AlreadyBeingCreatedException.

        It also modifies the io.retry package a litle bit.
        1. It allows RetryProxy to take a non-public interface. Thank Tom for making the change.
        2. It adds an ExponentialBackoff retry policy to RetryPolicies.

        Show
        Hairong Kuang added a comment - This patch mainly builds a retry mechanism for DFSClient. For all the proposed operations, it retries at most a configurable times using an exponential backoff algorithm when receiving SocketTimeoutException. It retries at most a configurable times using a fixed interval when receiving AlreadyBeingCreatedException. It also modifies the io.retry package a litle bit. 1. It allows RetryProxy to take a non-public interface. Thank Tom for making the change. 2. It adds an ExponentialBackoff retry policy to RetryPolicies.
        Hide
        Hairong Kuang added a comment -

        > The think the create()and cleanup() RPCs can be retried too.
        I believe that Dhruba meant create() and complete(). +1 on his suggestion.

        Show
        Hairong Kuang added a comment - > The think the create()and cleanup() RPCs can be retried too. I believe that Dhruba meant create() and complete(). +1 on his suggestion.
        Hide
        dhruba borthakur added a comment -

        The think the create()and cleanup() RPCs can be retried too. In fact, the DFSClient, in it current incantation, does retry the create() RPC three times and the cleanup() RPC indefinitely. I believe that it is safe to retry them because:

        1. The first attempt did not even reach the namenode. The retried second attempt can reach the namenode and successfully complete.
        2. The first attempt was processed by the namenode successfully but the response did not reach the dfsclient. The dfsclient can retry and the retries will fail. The entire operation fails anyway.

        Either way, it is safe to retry the create() & cleanup() RPCs.

        Show
        dhruba borthakur added a comment - The think the create()and cleanup() RPCs can be retried too. In fact, the DFSClient, in it current incantation, does retry the create() RPC three times and the cleanup() RPC indefinitely. I believe that it is safe to retry them because: 1. The first attempt did not even reach the namenode. The retried second attempt can reach the namenode and successfully complete. 2. The first attempt was processed by the namenode successfully but the response did not reach the dfsclient. The dfsclient can retry and the retries will fail. The entire operation fails anyway. Either way, it is safe to retry the create() & cleanup() RPCs.
        Hide
        Hairong Kuang added a comment -

        The annotation method proposed in HADOOP-601 to provide a general retry framework in rpc seems to be a simple solution, but since it is not implemented, for this jira, I plan to implement the retry mechanism for only ClientProtocol using the retry framework implemented in HADOOP-997. Here are what I plan to do:

        1. Add an exponential backoff policy to RetryPolicies.
        2. Create a retry proxy for the dfs client using the following method-to-RetryPolicy hashmap:

        • TRY-ONCE-THEN-FAIL: create, addBlock, complete
        • EXPONENTIAL-BACKOFF: open, setReplication, abandonBlock, abandonFileInProgress, reportBadBlocks, exists, isDir, getListing, getHints, renewLease, getStats, getDatanodeReport, getBlockSize, getEditLogSize
        • I still have not decided which retry policy to use for
          (1) rename, delete, mkdirs because a retry following a successful operation at the server side will return false instead of true;
          (2) setSafeMode, refreshNodes, rollEditLog, rollFsImage, finalizeUpgrade, metaSave because I still need time to read the code for these methods.

        Any suggestion is welcome!

        Show
        Hairong Kuang added a comment - The annotation method proposed in HADOOP-601 to provide a general retry framework in rpc seems to be a simple solution, but since it is not implemented, for this jira, I plan to implement the retry mechanism for only ClientProtocol using the retry framework implemented in HADOOP-997 . Here are what I plan to do: 1. Add an exponential backoff policy to RetryPolicies. 2. Create a retry proxy for the dfs client using the following method-to-RetryPolicy hashmap: TRY-ONCE-THEN-FAIL: create, addBlock, complete EXPONENTIAL-BACKOFF: open, setReplication, abandonBlock, abandonFileInProgress, reportBadBlocks, exists, isDir, getListing, getHints, renewLease, getStats, getDatanodeReport, getBlockSize, getEditLogSize I still have not decided which retry policy to use for (1) rename, delete, mkdirs because a retry following a successful operation at the server side will return false instead of true; (2) setSafeMode, refreshNodes, rollEditLog, rollFsImage, finalizeUpgrade, metaSave because I still need time to read the code for these methods. Any suggestion is welcome!
        Hide
        Hairong Kuang added a comment -

        On a second thought, I feel that simply adding a retry mechanism to ipc may not work because not all the server operations are idempotent. One option is to add an additional parameter to each call indicating if the call should be retried or not when time out. But it seems that our current rpc framework is hard to support this feature.

        Show
        Hairong Kuang added a comment - On a second thought, I feel that simply adding a retry mechanism to ipc may not work because not all the server operations are idempotent. One option is to add an additional parameter to each call indicating if the call should be retried or not when time out. But it seems that our current rpc framework is hard to support this feature.
        Hide
        Hairong Kuang added a comment -

        Tome, thank you so much. Owen also mentioned the retry package to me yesterday. I start to take a look at it and try to figure out how to use it. The info that you provided is definitely helpful.

        Show
        Hairong Kuang added a comment - Tome, thank you so much. Owen also mentioned the retry package to me yesterday. I start to take a look at it and try to figure out how to use it. The info that you provided is definitely helpful.
        Hide
        Tom White added a comment -

        We should be able to use the org.apache.hadoop.io.retry package for this. See http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/retry/package-summary.html. There are a number of policies available (http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/retry/RetryPolicies.html), including one that sleeps by an amount proportional to the number of retries. Writing one with exponential backoff would be easy too.

        See also HADOOP-601.

        Show
        Tom White added a comment - We should be able to use the org.apache.hadoop.io.retry package for this. See http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/retry/package-summary.html . There are a number of policies available ( http://lucene.apache.org/hadoop/api/org/apache/hadoop/io/retry/RetryPolicies.html ), including one that sleeps by an amount proportional to the number of retries. Writing one with exponential backoff would be easy too. See also HADOOP-601 .
        Hide
        dhruba borthakur added a comment -

        I like the idea of having a small timeout at first and then having exponential backoff so as to not load the server. This is definitely a good thing for "server scalability". If you decide to do it in the ipc package, then all the retry loops in the DFSClient might need to go away.

        Show
        dhruba borthakur added a comment - I like the idea of having a small timeout at first and then having exponential backoff so as to not load the server. This is definitely a good thing for "server scalability". If you decide to do it in the ipc package, then all the retry loops in the DFSClient might need to go away.
        Hide
        Hairong Kuang added a comment -

        The rpc time out problem is not specific to dfs. Shall we fix it in the ipc package? I am thinking to use an exponential backoff algorithm to reduce the load on the name server.

        Show
        Hairong Kuang added a comment - The rpc time out problem is not specific to dfs. Shall we fix it in the ipc package? I am thinking to use an exponential backoff algorithm to reduce the load on the name server.

          People

          • Assignee:
            Hairong Kuang
            Reporter:
            Christian Kunz
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development