Issue Details (XML | Word | Printable)

Key: HADOOP-1263
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Hairong Kuang
Reporter: Christian Kunz
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

retry logic when dfs exist or open fails temporarily, e.g because of timeout

Created: 17/Apr/07 05:27 AM   Updated: 08/Jul/09 04:42 PM
Return to search
Component/s: None
Affects Version/s: 0.12.3
Fix Version/s: 0.13.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works retry.patch 2007-05-02 12:43 AM Hairong Kuang 8 kB
Text File Licensed for inclusion in ASF works retry1.patch 2007-05-03 07:14 PM Hairong Kuang 9 kB

Resolution Date: 07/May/07 07:43 PM


 Description  « Hide
Sometimes, when many (e.g. 1000+) map jobs start at about the same time and require supporting files from filecache, it happens that some map tasks fail because of rpc timeouts. With only the default number of 10 handlers on the namenode, the probability is high that the whole job fails (see Hadoop-1182). It is much better with a higher number of handlers, but some map tasks still fail.

This could be avoided if rpc clients did retry when encountering a timeout before throwing an exception.

Examples of exceptions:

java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy1.exists(Unknown Source)
at org.apache.hadoop.dfs.DFSClient.exists(DFSClient.java:320)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(DistributedFileSystem.java:170)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.(ChecksumFileSystem.java:110)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
at org.apache.hadoop.filecache.DistributedCache.createMD5(DistributedCache.java:327)
at org.apache.hadoop.filecache.DistributedCache.ifExistsAndFresh(DistributedCache.java:253)
at org.apache.hadoop.filecache.DistributedCache.localizeCache(DistributedCache.java:169)
at org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:86)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:117)

java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy1.open(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:511)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.<init>(DFSClient.java:498)
at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:207)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.open(DistributedFileSystem.java:129)
at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.<init>(ChecksumFileSystem.java:110)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
at org.apache.hadoop.fs.ChecksumFileSystem.copyToLocalFile(ChecksumFileSystem.java:577)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:766)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:370)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:877)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:545)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:913)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1603)



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hairong Kuang made changes - 24/Apr/07 10:52 PM
Field Original Value New Value
Assignee Hairong Kuang [ hairong ]
Hairong Kuang made changes - 01/May/07 10:00 PM
Attachment retry.patch [ 12356579 ]
Hairong Kuang made changes - 01/May/07 10:11 PM
Attachment retry.patch [ 12356579 ]
Hairong Kuang made changes - 01/May/07 10:12 PM
Attachment retry.patch [ 12356582 ]
Hairong Kuang made changes - 02/May/07 12:23 AM
Attachment retry.patch [ 12356591 ]
Hairong Kuang made changes - 02/May/07 12:26 AM
Attachment retry.patch [ 12356582 ]
Hairong Kuang made changes - 02/May/07 12:43 AM
Attachment retry.patch [ 12356592 ]
Hairong Kuang made changes - 02/May/07 12:43 AM
Attachment retry.patch [ 12356591 ]
Hairong Kuang made changes - 03/May/07 07:14 PM
Attachment retry1.patch [ 12356731 ]
Hairong Kuang made changes - 03/May/07 10:37 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Fix Version/s 0.13.0 [ 12312348 ]
Doug Cutting made changes - 07/May/07 07:43 PM
Status Patch Available [ 10002 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Doug Cutting made changes - 08/Jun/07 08:40 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:42 PM
Component/s dfs [ 12310710 ]