Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1263

retry logic when dfs exist or open fails temporarily, e.g because of timeout

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.12.3
    • 0.13.0
    • None
    • None

    Description

      Sometimes, when many (e.g. 1000+) map jobs start at about the same time and require supporting files from filecache, it happens that some map tasks fail because of rpc timeouts. With only the default number of 10 handlers on the namenode, the probability is high that the whole job fails (see Hadoop-1182). It is much better with a higher number of handlers, but some map tasks still fail.

      This could be avoided if rpc clients did retry when encountering a timeout before throwing an exception.

      Examples of exceptions:

      java.net.SocketTimeoutException: timed out waiting for rpc response
      at org.apache.hadoop.ipc.Client.call(Client.java:473)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
      at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
      at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
      at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
      at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
      at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
      at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
      at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
      at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
      at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
      at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
      at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

      java.net.SocketTimeoutException: timed out waiting for rpc response
      at org.apache.hadoop.ipc.Client.call(Client.java:473)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
      at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
      at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
      at org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
      at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
      at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
      at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
      at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
      at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
      at org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
      at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
      at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
      at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
      at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
      at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
      at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

      Attachments

        1. retry.patch
          8 kB
          Hairong Kuang
        2. retry1.patch
          9 kB
          Hairong Kuang

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            hairong Hairong Kuang
            ckunz Christian Kunz
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment