[HADOOP-1263] retry logic when dfs exist or open fails temporarily, e.g because of timeout - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.12.3
Fix Version/s: 0.13.0
Component/s: None
Labels:
None

Description

Sometimes, when many (e.g. 1000+) map jobs start at about the same time and require supporting files from filecache, it happens that some map tasks fail because of rpc timeouts. With only the default number of 10 handlers on the namenode, the probability is high that the whole job fails (see Hadoop-1182). It is much better with a higher number of handlers, but some map tasks still fail.

This could be avoided if rpc clients did retry when encountering a timeout before throwing an exception.

Examples of exceptions:

java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
at org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

retry1.patch
03/May/07 19:14
9 kB
Hairong Kuang
retry.patch
02/May/07 00:43
8 kB
Hairong Kuang

Activity

People

Assignee:: Hairong Kuang

Reporter:: Christian Kunz

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Apr/07 05:27

Updated:: 08/Jul/09 16:42

Resolved:: 07/May/07 19:43