|
The problem could be - I repeat could be- from
I have a patch that (correctly) puts the task tracker back to retrying This patch recognises ConnectException and extends them into ConnectExceptions that include (host,port) info; all other IOExceptions are relayed unchanged. I havent included any tests but it does work for the tests we have that bring up both datanodes and task trackers without the expected namenode and datatracker.
This is indeed a regression caused by
Basic question: why does Client wrap one IOException in another?
Steve, is this a vanilla 0.18? I am wondering how this actually happened. NetUtils.connect() is only in trunk. Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception from getConnection() ... Raghu,
> why does Client wrap one IOException in another? I dont know the original reason; >is this a vanilla 0.18? I'm only work with SVN_HEAD; it's present there. If Hairong thinks it came in with > Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception from getConnection() ... Client.call doesnt catch the exception. The problem is that RPC.waitForProxy does, and it handles ConnectException and SocketTimeoutException by logging, sleeping, and trying again. This was not happening when the ConnectException was being downgraded, so the task tracker was failing if it came up before the job tracker, rather than waiting quietly for the tracker to come back up. As a result there is a race condition in cluster startup and the cluster is more brittle Here's where the exceptions get picked up in RPC.java public static VersionedProtocol waitForProxy(Class protocol, > Client.call doesnt catch the exception. ...
right. call() and waitForProxy() didn't change w.r.t. this. Now I see : the main difference is that setupIOstreams() used to throw the exception it got before I think this should be a blocker for 0.18.3. Yes, looks that throwing IOException from setupIOstreams is the easiest fix. Here comes the patch! I have tested it on my local machine and it worked.
+1 on the patch.
semi-related : I think adding hostip and port to "local exception" message as suggested by Steve in I do not think
The two patches in here do slightly different things, and need to be merged.
-my one left the class of an exception alone (and, for ConnectExceptions, inserted the host and port into the text, as with I think both are needed, so will apply Hairong's to my code and generate a combined patch. I can test this by deploying an orphan task tracker, but in that situation, once the code is fixed, the TaskTracker will spin forever. If a timeout on the retries could be provided, we could add a test that verified the tracker ran for 20-30s before timing out and relaying the exception. In production you'd set the timeout to a number of hours or forever, obviously. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12394093/connectRetry.patch against trunk revision 718232. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/testReport/ This message is automatically generated. On the second thought, throwing IOException from setupIOstreams is not a complete solution. While one RPC is in the middle of setupIOstreams, there might be a different call that uses the same connection and is about to setupIOstreams. If the first setup gets a ConnectException, the second call will end up seeing a closed connection and ConnectException gets delivered to call in call.error. This means that ConnectException will be wrapped.
I am thinking to solve the problem using my initial solution: check root cause in waitForProxy. As for testing, I like Steve's idea. How about adding an API wait for proxy with a timeout to RPC. The current API waitForProxy could use a timeout with the max long value. > [...] second call will end up seeing a closed connection and ConnectException gets delivered to call in call.error. [...]
From the code it does not seem to be the case (addCall() in getConnection() will be false). I might be mistaken... But if what you say is correct, then it is a bug that new call tries to use a closed connection rather than creating a new one as it should. I meant that for the second call, the call gets added to the connection before the first call marks the connection as closed.
This patch checks the cause of the failure when setting up a RPC client tries to connect to a RPC server. It retries if it is caused by an unavailable or busy server.
It adds a new static method waitForProxy with a timeout mainly for the purpose of testing. A unit test is added to TestRPC to makes sure that client retries. A manual test is also conducted that starting a DataNode without starting NameNode causes DataNode to retry. Steve, could you please review and test the patch in your setup? I appreciate any of your feedback. I'll look at this today. Last night's test run hung on TestFileCreationClient which may indicate some problems there; more research needed.
I'm going to put a merged patch up, but although the RPC test is passing, the spinning appears to be creating deadlock in TestFileCreationClient; relevant bits of the thread dump to follow.
1. We're sleeping here holding Connection@0x2e4f3e0 [junit] "DataStreamer for file /wrwelkj/file9 block blk_-4298389317957709021_1010" id=133 idx=0x210 tid=25976 prio=5 alive, in native, sleeping, native_waiting, daemon 2. Which is blocking this and this [junit] – Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin lock] and this So: the sleep in setupIOStreams appears to be blocking the other operations. for some reason, <junit> isn't timing out or killing the process, which implies this is fairly serious. Here's a merged patch that retains the same exception types as before (so the calling code does not need to look inside nested exceptions), and which contains the test. The new TestRPC test is correctly detecting failure to connect.
Where we do have a problem is that on my machine (64-bit JRockit JVM on Ubuntu), I'm seeing TestFileCreationClient hang and it appears to be in these methods. Accordingly I'm not setting the patch available flag as it may cause trouble for Hudson. Hi Steve, I have a few comments on your new patch:
1. I do not think it is right to throw IOException in setupIOstreams as I commented yesterday; If setIOstreams do throw, then most of the time, ConnectionException will get thrown at getConnection in call() and will not deliver to call.error; So it is of no use to wrap the exception there. 2. We wrap local exception on purpose in I want to get waitForProxy work in 0.18.3. Can we agree on a minimum change to make it work? RPC is so fundamental in Hadoop that any minor change may cause unexpected problem. So I am thinking that the smaller the change the better. If you do not like the idea of checking the cause of failure in waitForProxy, I am OK with your idea of Client.call() wrapping ConnectException as ConnectException, SocketTimeoutException as SockettimeoutException, and other exceptions as IOException. As for the junit test failure, I can not reproduce it on my local machine. Can you check why the RPC server is not up so the first thread stuck in waitForProxy? Could you please tell me on which line of the test that the test hang? Looks that the hanging junit test TestFileCreationClient is caused by
OK, I'll update my code and rerun it, though my timetable for monday/tuesday is a bit patchy. I also have to handle the jetty6 migration, which, while welcome. may complicate my life for a day or so
Hi Steve, I already have a patch which uses most of the code in your previous patch. You have a very clean programming style.
I'm going to push out my updated lifecycle patches shortly. One test I have there brings up a tasktracker without the rest of the infrastructure (DFS, jobtracker); it is now hanging until the test times out, spinning while things get set up, waiting for a job tracker that never arrives.
[junit] Tue Nov 25 13:50:13 2008 What I propose here is to move TaskTracker to have a timeout on its waitForProxy() operation, so that if the TT comes up before the JT, there's a bit of leeway, but eventually the TT will conclude that it is an orphan and that it cannot start up Adding a timeout on TaskTracker's call for waitForProxy() lets us have a TaskTracker that eventually gives up when the JobTracker isnt there. The other places that waitForProxy() are called with no timeouts are in the connection routines of Datanode and SecondaryNameNode(). I think all of these should be designed to take a configuration parameter that tells them when to give up, and a default value of many minutes or more to deal with basic choreography issues in a cluster. Test clusters can be set up to fail sooner rather than later.
Thoughts? Yes, I really like the idea of having timeout in waitForProxy when TaskTracker, DataNode, and SecondaryNameNode connect to a server. I would propose to do this in the trunk in a separate jira since it is not a regression. How do you think?
Steve, thanks for testing this patch and putting a lot thoughts on this issue. I created HADOOP-4724 to handle TT, DN, and SN timeout problem.
Ant test-core passed:
BUILD SUCCESSFUL Total time: 122 minutes 21 seconds Ant test-patch passed too: [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warningmessages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 patch looks good.
I just committed this. Thank you Steve!
A timeout of '0' conventionally means it should not timeout. Here it returns immediately. I think we should have JavaDoc for waitForProxy() so that this is explicit. Hopefully these get fixed in HADOOP-4724.
I am adding incompatible change flag, because e.g. secondary name-node used to fail after 10 attempts. Now it is trying forever.
It is not incompatible w.r.t 0.17.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Termination Record: HOST morzine.hpl.hp.com:rootProcess:testOrphanTracker:action:taskTracker, type: abnormal, description: Service has halted (this termination was not expected)
java.io.IOException: Call to localhost/127.0.0.1:8012 failed on local exception: java.net.ConnectException: Connection refused
at org.apache.hadoop.ipc.Client.call(Client.java:699)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy7.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:453)
at org.apache.hadoop.mapred.TaskTracker.innerStart(TaskTracker.java:831)
at org.apache.hadoop.util.Service.start(Service.java:186)
at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.innerDeploy(HadoopServiceImpl.java:480)
at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.access$000(HadoopServiceImpl.java:47)
at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl$ServiceDeployerThread.execute(HadoopServiceImpl.java:630)
at org.smartfrog.sfcore.utils.SmartFrogThread.run(SmartFrogThread.java:279)
at org.smartfrog.sfcore.utils.WorkflowThread.run(WorkflowThread.java:117)
//and the nested exception
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:299)
at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:771)
at org.apache.hadoop.ipc.Client.call(Client.java:685)