Hadoop Common
  1. Hadoop Common
  2. HADOOP-4659

Root cause of connection failure is being lost to code that uses it for delaying startup

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.18.3
    • Fix Version/s: 0.18.3
    • Component/s: ipc
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed

      Description

      ipc.Client the root cause of a connection failure is being lost as the exception is wrapped, hence the outside code, the one that looks for that root cause, isn't working as expected. The results is you can't bring up a task tracker before job tracker, and probably the same for a datanode before a namenode. The change that triggered this is not yet located, I had thought it was HADOOP-3844 but I no longer believe this is the case.

      1. hadoop-4659.patch
        1 kB
        steve_l
      2. connectRetry.patch
        0.7 kB
        Hairong Kuang
      3. rpcConn.patch
        3 kB
        Hairong Kuang
      4. hadoop-4659.patch
        6 kB
        steve_l
      5. rpcConn1.patch
        6 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          steve_l added a comment -

          full stack trace.

          Termination Record: HOST morzine.hpl.hp.com:rootProcess:testOrphanTracker:action:taskTracker, type: abnormal, description: Service has halted (this termination was not expected)
          java.io.IOException: Call to localhost/127.0.0.1:8012 failed on local exception: java.net.ConnectException: Connection refused
          at org.apache.hadoop.ipc.Client.call(Client.java:699)
          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
          at org.apache.hadoop.mapred.$Proxy7.getProtocolVersion(Unknown Source)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
          at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
          at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
          at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:453)
          at org.apache.hadoop.mapred.TaskTracker.innerStart(TaskTracker.java:831)
          at org.apache.hadoop.util.Service.start(Service.java:186)
          at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.innerDeploy(HadoopServiceImpl.java:480)
          at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.access$000(HadoopServiceImpl.java:47)
          at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl$ServiceDeployerThread.execute(HadoopServiceImpl.java:630)
          at org.smartfrog.sfcore.utils.SmartFrogThread.run(SmartFrogThread.java:279)
          at org.smartfrog.sfcore.utils.WorkflowThread.run(WorkflowThread.java:117)

          //and the nested exception

          Caused by: java.net.ConnectException: Connection refused
          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:299)
          at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
          at org.apache.hadoop.ipc.Client.getConnection(Client.java:771)
          at org.apache.hadoop.ipc.Client.call(Client.java:685)

          Show
          steve_l added a comment - full stack trace. Termination Record: HOST morzine.hpl.hp.com:rootProcess:testOrphanTracker:action:taskTracker, type: abnormal, description: Service has halted (this termination was not expected) java.io.IOException: Call to localhost/127.0.0.1:8012 failed on local exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:453) at org.apache.hadoop.mapred.TaskTracker.innerStart(TaskTracker.java:831) at org.apache.hadoop.util.Service.start(Service.java:186) at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.innerDeploy(HadoopServiceImpl.java:480) at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.access$000(HadoopServiceImpl.java:47) at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl$ServiceDeployerThread.execute(HadoopServiceImpl.java:630) at org.smartfrog.sfcore.utils.SmartFrogThread.run(SmartFrogThread.java:279) at org.smartfrog.sfcore.utils.WorkflowThread.run(WorkflowThread.java:117) //and the nested exception Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:299) at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176) at org.apache.hadoop.ipc.Client.getConnection(Client.java:771) at org.apache.hadoop.ipc.Client.call(Client.java:685)
          Hide
          steve_l added a comment -

          The problem could be - I repeat could be- from HADOOP-2188, though I'm not sure. There have been too many changes to roll back, and its easier to go forwards.

          I have a patch that (correctly) puts the task tracker back to retrying
          [sf-startdaemon-debug] 08/11/14 15:06:43 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 5 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:43 [Thread-41] INFO datanode.DataNode : BlockReport of 0 blocks got processed in 1 msecs
          [sf-startdaemon-debug] 08/11/14 15:06:44 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 6 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:45 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 7 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:46 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 8 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:47 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 9 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:47 [TaskTracker] INFO ipc.RPC : Server at localhost/127.0.0.1:8012 not available yet, Zzzzz...
          [sf-startdaemon-debug] 08/11/14 15:06:49 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 0 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:50 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 1 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:51 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 2 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:52 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 3 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:53 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 4 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:53 [Thread-41] INFO datanode.DataNode : BlockReport of 0 blocks got processed in 1 msecs
          [sf-startdaemon-debug] 08/11/14 15:06:54 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 5 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:55 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 6 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:56 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 7 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:57 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 8 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:58 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 9 time(s).
          [sf-startdaemon-debug] 08/11/14 15:06:58 [TaskTracker] INFO ipc.RPC : Server at localhost/127.0.0.1:8012 not available yet, Zzzzz...

          Show
          steve_l added a comment - The problem could be - I repeat could be- from HADOOP-2188 , though I'm not sure. There have been too many changes to roll back, and its easier to go forwards. I have a patch that (correctly) puts the task tracker back to retrying [sf-startdaemon-debug] 08/11/14 15:06:43 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 5 time(s). [sf-startdaemon-debug] 08/11/14 15:06:43 [Thread-41] INFO datanode.DataNode : BlockReport of 0 blocks got processed in 1 msecs [sf-startdaemon-debug] 08/11/14 15:06:44 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 6 time(s). [sf-startdaemon-debug] 08/11/14 15:06:45 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 7 time(s). [sf-startdaemon-debug] 08/11/14 15:06:46 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 8 time(s). [sf-startdaemon-debug] 08/11/14 15:06:47 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 9 time(s). [sf-startdaemon-debug] 08/11/14 15:06:47 [TaskTracker] INFO ipc.RPC : Server at localhost/127.0.0.1:8012 not available yet, Zzzzz... [sf-startdaemon-debug] 08/11/14 15:06:49 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 0 time(s). [sf-startdaemon-debug] 08/11/14 15:06:50 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 1 time(s). [sf-startdaemon-debug] 08/11/14 15:06:51 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 2 time(s). [sf-startdaemon-debug] 08/11/14 15:06:52 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 3 time(s). [sf-startdaemon-debug] 08/11/14 15:06:53 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 4 time(s). [sf-startdaemon-debug] 08/11/14 15:06:53 [Thread-41] INFO datanode.DataNode : BlockReport of 0 blocks got processed in 1 msecs [sf-startdaemon-debug] 08/11/14 15:06:54 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 5 time(s). [sf-startdaemon-debug] 08/11/14 15:06:55 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 6 time(s). [sf-startdaemon-debug] 08/11/14 15:06:56 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 7 time(s). [sf-startdaemon-debug] 08/11/14 15:06:57 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 8 time(s). [sf-startdaemon-debug] 08/11/14 15:06:58 [TaskTracker] INFO ipc.Client : Retrying connect to server: localhost/127.0.0.1:8012. Already tried 9 time(s). [sf-startdaemon-debug] 08/11/14 15:06:58 [TaskTracker] INFO ipc.RPC : Server at localhost/127.0.0.1:8012 not available yet, Zzzzz...
          Hide
          steve_l added a comment -

          This patch recognises ConnectException and extends them into ConnectExceptions that include (host,port) info; all other IOExceptions are relayed unchanged. I havent included any tests but it does work for the tests we have that bring up both datanodes and task trackers without the expected namenode and datatracker.

          Show
          steve_l added a comment - This patch recognises ConnectException and extends them into ConnectExceptions that include (host,port) info; all other IOExceptions are relayed unchanged. I havent included any tests but it does work for the tests we have that bring up both datanodes and task trackers without the expected namenode and datatracker.
          Hide
          Hairong Kuang added a comment -

          This is indeed a regression caused by HADOOP-2188. Relaying the error is not a good idea since the stack trace is confusing to the caller of RPC. Is it a better idea that RPC.waitForProxy checks the root cause of the LocalException and retries if the cause is ConnectionException?

          Show
          Hairong Kuang added a comment - This is indeed a regression caused by HADOOP-2188 . Relaying the error is not a good idea since the stack trace is confusing to the caller of RPC. Is it a better idea that RPC.waitForProxy checks the root cause of the LocalException and retries if the cause is ConnectionException?
          Hide
          Raghu Angadi added a comment -

          Basic question: why does Client wrap one IOException in another?

          Steve, is this a vanilla 0.18? I am wondering how this actually happened. NetUtils.connect() is only in trunk. Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception from getConnection() ...

          Show
          Raghu Angadi added a comment - Basic question: why does Client wrap one IOException in another? Steve, is this a vanilla 0.18? I am wondering how this actually happened. NetUtils.connect() is only in trunk. Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception from getConnection() ...
          Hide
          steve_l added a comment -

          Raghu,

          > why does Client wrap one IOException in another?

          I dont know the original reason; HADOOP-3844 retained this feature and included the hostname/port at fault which is handy for identifying configuration problems. The patch only adds this diagnostics to ConnectExceptions and passes the rest up

          >is this a vanilla 0.18?

          I'm only work with SVN_HEAD; it's present there. If Hairong thinks it came in with HADOOP-2188, then it also exists in 0.18, but that will need a different patch.

          > Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception from getConnection() ...

          Client.call doesnt catch the exception. The problem is that RPC.waitForProxy does, and it handles ConnectException and SocketTimeoutException by logging, sleeping, and trying again. This was not happening when the ConnectException was being downgraded, so the task tracker was failing if it came up before the job tracker, rather than waiting quietly for the tracker to come back up. As a result there is a race condition in cluster startup and the cluster is more brittle

          Here's where the exceptions get picked up in RPC.java

          public static VersionedProtocol waitForProxy(Class protocol,
          long clientVersion,
          InetSocketAddress addr,
          Configuration conf
          ) throws IOException {
          while (true) {
          try

          { return getProxy(protocol, clientVersion, addr, conf); }

          catch(ConnectException se)

          { // namenode has not been started LOG.info("Server at " + addr + " not available yet, Zzzzz..."); }

          catch(SocketTimeoutException te)

          { // namenode is busy LOG.info("Problem connecting to server: " + addr); }

          try

          { Thread.sleep(1000); }

          catch (InterruptedException ie)

          { // IGNORE }

          }
          }

          Show
          steve_l added a comment - Raghu, > why does Client wrap one IOException in another? I dont know the original reason; HADOOP-3844 retained this feature and included the hostname/port at fault which is handy for identifying configuration problems. The patch only adds this diagnostics to ConnectExceptions and passes the rest up >is this a vanilla 0.18? I'm only work with SVN_HEAD; it's present there. If Hairong thinks it came in with HADOOP-2188 , then it also exists in 0.18, but that will need a different patch. > Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception from getConnection() ... Client.call doesnt catch the exception. The problem is that RPC.waitForProxy does, and it handles ConnectException and SocketTimeoutException by logging, sleeping, and trying again. This was not happening when the ConnectException was being downgraded, so the task tracker was failing if it came up before the job tracker, rather than waiting quietly for the tracker to come back up. As a result there is a race condition in cluster startup and the cluster is more brittle Here's where the exceptions get picked up in RPC.java public static VersionedProtocol waitForProxy(Class protocol, long clientVersion, InetSocketAddress addr, Configuration conf ) throws IOException { while (true) { try { return getProxy(protocol, clientVersion, addr, conf); } catch(ConnectException se) { // namenode has not been started LOG.info("Server at " + addr + " not available yet, Zzzzz..."); } catch(SocketTimeoutException te) { // namenode is busy LOG.info("Problem connecting to server: " + addr); } try { Thread.sleep(1000); } catch (InterruptedException ie) { // IGNORE } } }
          Hide
          Raghu Angadi added a comment -

          > Client.call doesnt catch the exception. ...

          right. call() and waitForProxy() didn't change w.r.t. this. Now I see : the main difference is that setupIOstreams() used to throw the exception it got before HADOOP-2188. I think it should still throw. The local exception was supposed to be rare (some other connection error after writing the initial RPC request). Throwing the exception in setupIOStreams() will get this behavior to match pre-HADOOP-2188.

          I think this should be a blocker for 0.18.3.

          Show
          Raghu Angadi added a comment - > Client.call doesnt catch the exception. ... right. call() and waitForProxy() didn't change w.r.t. this. Now I see : the main difference is that setupIOstreams() used to throw the exception it got before HADOOP-2188 . I think it should still throw. The local exception was supposed to be rare (some other connection error after writing the initial RPC request). Throwing the exception in setupIOStreams() will get this behavior to match pre- HADOOP-2188 . I think this should be a blocker for 0.18.3.
          Hide
          Hairong Kuang added a comment -

          Yes, looks that throwing IOException from setupIOstreams is the easiest fix. Here comes the patch! I have tested it on my local machine and it worked.

          Show
          Hairong Kuang added a comment - Yes, looks that throwing IOException from setupIOstreams is the easiest fix. Here comes the patch! I have tested it on my local machine and it worked.
          Hide
          Raghu Angadi added a comment -

          +1 on the patch.

          semi-related : I think adding hostip and port to "local exception" message as suggested by Steve in HADOOP-3844 will be nice to have in this jira.

          Show
          Raghu Angadi added a comment - +1 on the patch. semi-related : I think adding hostip and port to "local exception" message as suggested by Steve in HADOOP-3844 will be nice to have in this jira.
          Hide
          Hairong Kuang added a comment -

          I do not think HADOOP-3844 is related to this jira. HADOOP-3844 does not add hostname:port to the message. HADOOP-3844 tried to add root cause of the exception to its message. With this patch, ConnectException does not ge wrapped. So this patch does not degrade HADOOP-3844.

          Show
          Hairong Kuang added a comment - I do not think HADOOP-3844 is related to this jira. HADOOP-3844 does not add hostname:port to the message. HADOOP-3844 tried to add root cause of the exception to its message. With this patch, ConnectException does not ge wrapped. So this patch does not degrade HADOOP-3844 .
          Hide
          steve_l added a comment -

          The two patches in here do slightly different things, and need to be merged.

          -my one left the class of an exception alone (and, for ConnectExceptions, inserted the host and port into the text, as with HADOOP-3844.
          -Hairong's stopped some exceptions getting swallowed during setupIOstreams

          I think both are needed, so will apply Hairong's to my code and generate a combined patch.

          I can test this by deploying an orphan task tracker, but in that situation, once the code is fixed, the TaskTracker will spin forever. If a timeout on the retries could be provided, we could add a test that verified the tracker ran for 20-30s before timing out and relaying the exception. In production you'd set the timeout to a number of hours or forever, obviously.

          Show
          steve_l added a comment - The two patches in here do slightly different things, and need to be merged. -my one left the class of an exception alone (and, for ConnectExceptions, inserted the host and port into the text, as with HADOOP-3844 . -Hairong's stopped some exceptions getting swallowed during setupIOstreams I think both are needed, so will apply Hairong's to my code and generate a combined patch. I can test this by deploying an orphan task tracker, but in that situation, once the code is fixed, the TaskTracker will spin forever. If a timeout on the retries could be provided, we could add a test that verified the tracker ran for 20-30s before timing out and relaying the exception. In production you'd set the timeout to a number of hours or forever, obviously.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12394093/connectRetry.patch
          against trunk revision 718232.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12394093/connectRetry.patch against trunk revision 718232. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3603/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          On the second thought, throwing IOException from setupIOstreams is not a complete solution. While one RPC is in the middle of setupIOstreams, there might be a different call that uses the same connection and is about to setupIOstreams. If the first setup gets a ConnectException, the second call will end up seeing a closed connection and ConnectException gets delivered to call in call.error. This means that ConnectException will be wrapped.

          I am thinking to solve the problem using my initial solution: check root cause in waitForProxy.

          As for testing, I like Steve's idea. How about adding an API wait for proxy with a timeout to RPC. The current API waitForProxy could use a timeout with the max long value.

          Show
          Hairong Kuang added a comment - On the second thought, throwing IOException from setupIOstreams is not a complete solution. While one RPC is in the middle of setupIOstreams, there might be a different call that uses the same connection and is about to setupIOstreams. If the first setup gets a ConnectException, the second call will end up seeing a closed connection and ConnectException gets delivered to call in call.error. This means that ConnectException will be wrapped. I am thinking to solve the problem using my initial solution: check root cause in waitForProxy. As for testing, I like Steve's idea. How about adding an API wait for proxy with a timeout to RPC. The current API waitForProxy could use a timeout with the max long value.
          Hide
          Raghu Angadi added a comment -

          > [...] second call will end up seeing a closed connection and ConnectException gets delivered to call in call.error. [...]

          From the code it does not seem to be the case (addCall() in getConnection() will be false). I might be mistaken...

          But if what you say is correct, then it is a bug that new call tries to use a closed connection rather than creating a new one as it should.

          Show
          Raghu Angadi added a comment - > [...] second call will end up seeing a closed connection and ConnectException gets delivered to call in call.error. [...] From the code it does not seem to be the case (addCall() in getConnection() will be false). I might be mistaken... But if what you say is correct, then it is a bug that new call tries to use a closed connection rather than creating a new one as it should.
          Hide
          Hairong Kuang added a comment -

          I meant that for the second call, the call gets added to the connection before the first call marks the connection as closed.

          Show
          Hairong Kuang added a comment - I meant that for the second call, the call gets added to the connection before the first call marks the connection as closed.
          Hide
          Hairong Kuang added a comment -

          This patch checks the cause of the failure when setting up a RPC client tries to connect to a RPC server. It retries if it is caused by an unavailable or busy server.

          It adds a new static method waitForProxy with a timeout mainly for the purpose of testing. A unit test is added to TestRPC to makes sure that client retries. A manual test is also conducted that starting a DataNode without starting NameNode causes DataNode to retry.

          Steve, could you please review and test the patch in your setup? I appreciate any of your feedback.

          Show
          Hairong Kuang added a comment - This patch checks the cause of the failure when setting up a RPC client tries to connect to a RPC server. It retries if it is caused by an unavailable or busy server. It adds a new static method waitForProxy with a timeout mainly for the purpose of testing. A unit test is added to TestRPC to makes sure that client retries. A manual test is also conducted that starting a DataNode without starting NameNode causes DataNode to retry. Steve, could you please review and test the patch in your setup? I appreciate any of your feedback.
          Hide
          steve_l added a comment -

          I'll look at this today. Last night's test run hung on TestFileCreationClient which may indicate some problems there; more research needed.

          Show
          steve_l added a comment - I'll look at this today. Last night's test run hung on TestFileCreationClient which may indicate some problems there; more research needed.
          Hide
          steve_l added a comment -

          I'm going to put a merged patch up, but although the RPC test is passing, the spinning appears to be creating deadlock in TestFileCreationClient; relevant bits of the thread dump to follow.

          1. We're sleeping here holding Connection@0x2e4f3e0

          [junit] "DataStreamer for file /wrwelkj/file9 block blk_-4298389317957709021_1010" id=133 idx=0x210 tid=25976 prio=5 alive, in native, sleeping, native_waiting, daemon
          [junit] at java/lang/Thread.sleep(J)V(Native Method)
          [junit] at org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:373)
          [junit] at org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:310)
          [junit] ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin lock]
          [junit] at org/apache/hadoop/ipc/Client$Connection.access$1700(Client.java:177)
          [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:791)
          [junit] at org/apache/hadoop/ipc/Client.call(Client.java:697)
          [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
          [junit] at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)

          2. Which is blocking this
          [junit] – Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin lock]
          [junit] at jrockit/vm/Threads.sleep(I)V(Native Method)
          [junit] at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
          [junit] at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
          [junit] at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
          [junit] at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
          [junit] at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
          [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
          [junit] at org/apache/hadoop/ipc/Client.call(Client.java:697)
          [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
          [junit] at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
          [junit] at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)

          and this

          [junit] – Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin lock]
          [junit] at jrockit/vm/Threads.sleep(I)V(Native Method)
          [junit] at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
          [junit] at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
          [junit] at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
          [junit] at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
          [junit] at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
          [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
          [junit] at org/apache/hadoop/ipc/Client.call(Client.java:697)
          [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
          [junit] at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
          [junit] at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
          [junit] ^-- Holding lock: java/util/LinkedList@0x1eb5e20[fat lock]
          [junit] at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
          [junit] – end of trace
          [junit] "DataStreamer for file /wrwelkj/file5 block blk_7479178383257153500_1010" id=127

          and this
          idx=0x200 tid=25971 prio=5 alive, in native, blocked, daemon
          [junit] – Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0[thin lock]
          [junit] at jrockit/vm/Threads.sleep(I)V(Native Method)
          [junit] at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233)[optimized]
          [junit] at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307)[optimized]
          [junit] at jrockit/vm/Locks.monitorEnter(Locks.java:2389)[optimized]
          [junit] at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219)
          [junit] at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177)
          [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:785)
          [junit] at org/apache/hadoop/ipc/Client.call(Client.java:697)
          [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216)
          [junit] at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286)
          [junit] at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997)
          [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160)
          [junit] ^-- Holding lock: java/util/LinkedList@0x1ea6858[fat lock]
          [junit] at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)

          So: the sleep in setupIOStreams appears to be blocking the other operations. for some reason, <junit> isn't timing out or killing the process, which implies this is fairly serious.

          Show
          steve_l added a comment - I'm going to put a merged patch up, but although the RPC test is passing, the spinning appears to be creating deadlock in TestFileCreationClient; relevant bits of the thread dump to follow. 1. We're sleeping here holding Connection@0x2e4f3e0 [junit] "DataStreamer for file /wrwelkj/file9 block blk_-4298389317957709021_1010" id=133 idx=0x210 tid=25976 prio=5 alive, in native, sleeping, native_waiting, daemon [junit] at java/lang/Thread.sleep(J)V(Native Method) [junit] at org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:373) [junit] at org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:310) [junit] ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0 [thin lock] [junit] at org/apache/hadoop/ipc/Client$Connection.access$1700(Client.java:177) [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:791) [junit] at org/apache/hadoop/ipc/Client.call(Client.java:697) [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216) [junit] at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286) 2. Which is blocking this [junit] – Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0 [thin lock] [junit] at jrockit/vm/Threads.sleep(I)V(Native Method) [junit] at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233) [optimized] [junit] at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307) [optimized] [junit] at jrockit/vm/Locks.monitorEnter(Locks.java:2389) [optimized] [junit] at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219) [junit] at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177) [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:785) [junit] at org/apache/hadoop/ipc/Client.call(Client.java:697) [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216) [junit] at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286) [junit] at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160) and this [junit] – Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0 [thin lock] [junit] at jrockit/vm/Threads.sleep(I)V(Native Method) [junit] at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233) [optimized] [junit] at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307) [optimized] [junit] at jrockit/vm/Locks.monitorEnter(Locks.java:2389) [optimized] [junit] at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219) [junit] at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177) [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:785) [junit] at org/apache/hadoop/ipc/Client.call(Client.java:697) [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216) [junit] at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286) [junit] at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160) [junit] ^-- Holding lock: java/util/LinkedList@0x1eb5e20 [fat lock] [junit] at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) [junit] – end of trace [junit] "DataStreamer for file /wrwelkj/file5 block blk_7479178383257153500_1010" id=127 and this idx=0x200 tid=25971 prio=5 alive, in native, blocked, daemon [junit] – Blocked trying to get lock: org/apache/hadoop/ipc/Client$Connection@0x2e4f3e0 [thin lock] [junit] at jrockit/vm/Threads.sleep(I)V(Native Method) [junit] at jrockit/vm/Locks.waitForThinRelease(Locks.java:1233) [optimized] [junit] at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1307) [optimized] [junit] at jrockit/vm/Locks.monitorEnter(Locks.java:2389) [optimized] [junit] at org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:219) [junit] at org/apache/hadoop/ipc/Client$Connection.access$1600(Client.java:177) [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:785) [junit] at org/apache/hadoop/ipc/Client.call(Client.java:697) [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:216) [junit] at $Proxy7.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:340) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:327) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:364) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:299) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:286) [junit] at org/apache/hadoop/hdfs/DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:141) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2469) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream.access$1700(DFSClient.java:1997) [junit] at org/apache/hadoop/hdfs/DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2160) [junit] ^-- Holding lock: java/util/LinkedList@0x1ea6858 [fat lock] [junit] at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) So: the sleep in setupIOStreams appears to be blocking the other operations. for some reason, <junit> isn't timing out or killing the process, which implies this is fairly serious.
          Hide
          steve_l added a comment -

          Here's a merged patch that retains the same exception types as before (so the calling code does not need to look inside nested exceptions), and which contains the test. The new TestRPC test is correctly detecting failure to connect.

          Where we do have a problem is that on my machine (64-bit JRockit JVM on Ubuntu), I'm seeing TestFileCreationClient hang and it appears to be in these methods. Accordingly I'm not setting the patch available flag as it may cause trouble for Hudson.

          Show
          steve_l added a comment - Here's a merged patch that retains the same exception types as before (so the calling code does not need to look inside nested exceptions), and which contains the test. The new TestRPC test is correctly detecting failure to connect. Where we do have a problem is that on my machine (64-bit JRockit JVM on Ubuntu), I'm seeing TestFileCreationClient hang and it appears to be in these methods. Accordingly I'm not setting the patch available flag as it may cause trouble for Hudson.
          Hide
          Hairong Kuang added a comment -

          Hi Steve, I have a few comments on your new patch:
          1. I do not think it is right to throw IOException in setupIOstreams as I commented yesterday; If setIOstreams do throw, then most of the time, ConnectionException will get thrown at getConnection in call() and will not deliver to call.error; So it is of no use to wrap the exception there.
          2. We wrap local exception on purpose in HADOOP-2811. The reason we did that is to give the RPC caller a right strack trace. A RPC may fail at the time of setting up a connection, sending the request, receiving a response, or a failure caused by another RPC call. Those errors may occur in different threads. The stack trace will be confusing to the caller if we do not wrap them.

          I want to get waitForProxy work in 0.18.3. Can we agree on a minimum change to make it work? RPC is so fundamental in Hadoop that any minor change may cause unexpected problem. So I am thinking that the smaller the change the better. If you do not like the idea of checking the cause of failure in waitForProxy, I am OK with your idea of Client.call() wrapping ConnectException as ConnectException, SocketTimeoutException as SockettimeoutException, and other exceptions as IOException.

          As for the junit test failure, I can not reproduce it on my local machine. Can you check why the RPC server is not up so the first thread stuck in waitForProxy? Could you please tell me on which line of the test that the test hang?

          Show
          Hairong Kuang added a comment - Hi Steve, I have a few comments on your new patch: 1. I do not think it is right to throw IOException in setupIOstreams as I commented yesterday; If setIOstreams do throw, then most of the time, ConnectionException will get thrown at getConnection in call() and will not deliver to call.error; So it is of no use to wrap the exception there. 2. We wrap local exception on purpose in HADOOP-2811 . The reason we did that is to give the RPC caller a right strack trace. A RPC may fail at the time of setting up a connection, sending the request, receiving a response, or a failure caused by another RPC call. Those errors may occur in different threads. The stack trace will be confusing to the caller if we do not wrap them. I want to get waitForProxy work in 0.18.3. Can we agree on a minimum change to make it work? RPC is so fundamental in Hadoop that any minor change may cause unexpected problem. So I am thinking that the smaller the change the better. If you do not like the idea of checking the cause of failure in waitForProxy, I am OK with your idea of Client.call() wrapping ConnectException as ConnectException, SocketTimeoutException as SockettimeoutException, and other exceptions as IOException. As for the junit test failure, I can not reproduce it on my local machine. Can you check why the RPC server is not up so the first thread stuck in waitForProxy? Could you please tell me on which line of the test that the test hang?
          Hide
          Hairong Kuang added a comment -

          Looks that the hanging junit test TestFileCreationClient is caused by HADOOP-4703.

          Show
          Hairong Kuang added a comment - Looks that the hanging junit test TestFileCreationClient is caused by HADOOP-4703 .
          Hide
          steve_l added a comment -

          OK, I'll update my code and rerun it, though my timetable for monday/tuesday is a bit patchy. I also have to handle the jetty6 migration, which, while welcome. may complicate my life for a day or so

          Show
          steve_l added a comment - OK, I'll update my code and rerun it, though my timetable for monday/tuesday is a bit patchy. I also have to handle the jetty6 migration, which, while welcome. may complicate my life for a day or so
          Hide
          Hairong Kuang added a comment -

          Hi Steve, I already have a patch which uses most of the code in your previous patch. You have a very clean programming style. The change that you may not like is that all exceptions are still wrapped. Please let me know what you think. Since you are busy, I will be happy to do the testing etc.

          Show
          Hairong Kuang added a comment - Hi Steve, I already have a patch which uses most of the code in your previous patch. You have a very clean programming style. The change that you may not like is that all exceptions are still wrapped. Please let me know what you think. Since you are busy, I will be happy to do the testing etc.
          Hide
          steve_l added a comment -

          I'm going to push out my updated lifecycle patches shortly. One test I have there brings up a tasktracker without the rest of the infrastructure (DFS, jobtracker); it is now hanging until the test times out, spinning while things get set up, waiting for a job tracker that never arrives.

          [junit] Tue Nov 25 13:50:13 2008
          [junit] BEA JRockit(R) R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64
          [junit] "Main Thread" id=1 idx=0x4 tid=4074 prio=5 alive, in native, sleeping, native_waiting
          [junit] at java/lang/Thread.sleep(J)V(Native Method)
          [junit] at org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:364)
          [junit] at org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:310)
          [junit] ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1184b18[thin lock]
          [junit] at org/apache/hadoop/ipc/Client$Connection.access$1800(Client.java:177)
          [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:792)
          [junit] at org/apache/hadoop/ipc/Client.call(Client.java:688)
          [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:215)
          [junit] at org/apache/hadoop/mapred/$Proxy0.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:347)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:334)
          [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:371)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:308)
          [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:285)
          [junit] at org/apache/hadoop/mapred/TaskTracker.initialize(TaskTracker.java:454)
          [junit] ^-- Holding lock: org/apache/hadoop/mapred/TaskTracker@0x34c4748[recursive]
          [junit] at org/apache/hadoop/mapred/TaskTracker.innerStart(TaskTracker.java:830)
          [junit] ^-- Holding lock: org/apache/hadoop/mapred/TaskTracker@0x34c4748[thin lock]
          [junit] at org/apache/hadoop/util/Service.start(Service.java:186)
          [junit] at org/apache/hadoop/util/Service.deploy(Service.java:654)
          [junit] at org/apache/hadoop/mapred/TaskTracker.<init>(TaskTracker.java:965)
          [junit] at org/apache/hadoop/mapred/TaskTracker.<init>(TaskTracker.java:948)

          What I propose here is to move TaskTracker to have a timeout on its waitForProxy() operation, so that if the TT comes up before the JT, there's a bit of leeway, but eventually the TT will conclude that it is an orphan and that it cannot start up

          Show
          steve_l added a comment - I'm going to push out my updated lifecycle patches shortly. One test I have there brings up a tasktracker without the rest of the infrastructure (DFS, jobtracker); it is now hanging until the test times out, spinning while things get set up, waiting for a job tracker that never arrives. [junit] Tue Nov 25 13:50:13 2008 [junit] BEA JRockit(R) R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64 [junit] "Main Thread" id=1 idx=0x4 tid=4074 prio=5 alive, in native, sleeping, native_waiting [junit] at java/lang/Thread.sleep(J)V(Native Method) [junit] at org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:364) [junit] at org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:310) [junit] ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1184b18 [thin lock] [junit] at org/apache/hadoop/ipc/Client$Connection.access$1800(Client.java:177) [junit] at org/apache/hadoop/ipc/Client.getConnection(Client.java:792) [junit] at org/apache/hadoop/ipc/Client.call(Client.java:688) [junit] at org/apache/hadoop/ipc/RPC$Invoker.invoke(RPC.java:215) [junit] at org/apache/hadoop/mapred/$Proxy0.getProtocolVersion(Ljava/lang/String;J)J(Unknown Source) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:347) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:334) [junit] at org/apache/hadoop/ipc/RPC.getProxy(RPC.java:371) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:308) [junit] at org/apache/hadoop/ipc/RPC.waitForProxy(RPC.java:285) [junit] at org/apache/hadoop/mapred/TaskTracker.initialize(TaskTracker.java:454) [junit] ^-- Holding lock: org/apache/hadoop/mapred/TaskTracker@0x34c4748 [recursive] [junit] at org/apache/hadoop/mapred/TaskTracker.innerStart(TaskTracker.java:830) [junit] ^-- Holding lock: org/apache/hadoop/mapred/TaskTracker@0x34c4748 [thin lock] [junit] at org/apache/hadoop/util/Service.start(Service.java:186) [junit] at org/apache/hadoop/util/Service.deploy(Service.java:654) [junit] at org/apache/hadoop/mapred/TaskTracker.<init>(TaskTracker.java:965) [junit] at org/apache/hadoop/mapred/TaskTracker.<init>(TaskTracker.java:948) What I propose here is to move TaskTracker to have a timeout on its waitForProxy() operation, so that if the TT comes up before the JT, there's a bit of leeway, but eventually the TT will conclude that it is an orphan and that it cannot start up
          Hide
          steve_l added a comment -

          Adding a timeout on TaskTracker's call for waitForProxy() lets us have a TaskTracker that eventually gives up when the JobTracker isnt there. The other places that waitForProxy() are called with no timeouts are in the connection routines of Datanode and SecondaryNameNode(). I think all of these should be designed to take a configuration parameter that tells them when to give up, and a default value of many minutes or more to deal with basic choreography issues in a cluster. Test clusters can be set up to fail sooner rather than later.

          Thoughts?

          Show
          steve_l added a comment - Adding a timeout on TaskTracker's call for waitForProxy() lets us have a TaskTracker that eventually gives up when the JobTracker isnt there. The other places that waitForProxy() are called with no timeouts are in the connection routines of Datanode and SecondaryNameNode(). I think all of these should be designed to take a configuration parameter that tells them when to give up, and a default value of many minutes or more to deal with basic choreography issues in a cluster. Test clusters can be set up to fail sooner rather than later. Thoughts?
          Hide
          Hairong Kuang added a comment -

          Yes, I really like the idea of having timeout in waitForProxy when TaskTracker, DataNode, and SecondaryNameNode connect to a server. I would propose to do this in the trunk in a separate jira since it is not a regression. How do you think?

          Show
          Hairong Kuang added a comment - Yes, I really like the idea of having timeout in waitForProxy when TaskTracker, DataNode, and SecondaryNameNode connect to a server. I would propose to do this in the trunk in a separate jira since it is not a regression. How do you think?
          Hide
          Hairong Kuang added a comment -

          Steve, thanks for testing this patch and putting a lot thoughts on this issue. I created HADOOP-4724 to handle TT, DN, and SN timeout problem.

          Show
          Hairong Kuang added a comment - Steve, thanks for testing this patch and putting a lot thoughts on this issue. I created HADOOP-4724 to handle TT, DN, and SN timeout problem.
          Hide
          Hairong Kuang added a comment -

          Ant test-core passed:
          BUILD SUCCESSFUL
          Total time: 122 minutes 21 seconds

          Ant test-patch passed too:
          [exec] +1 overall.

          [exec] +1 @author. The patch does not contain any @author tags.

          [exec] +1 tests included. The patch appears to include 3 new or modified tests.

          [exec] +1 javadoc. The javadoc tool did not generate any warningmessages.

          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          Show
          Hairong Kuang added a comment - Ant test-core passed: BUILD SUCCESSFUL Total time: 122 minutes 21 seconds Ant test-patch passed too: [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warningmessages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good.
          Hide
          Hairong Kuang added a comment -

          I just committed this. Thank you Steve!

          Show
          Hairong Kuang added a comment - I just committed this. Thank you Steve!
          Hide
          Raghu Angadi added a comment -

          A timeout of '0' conventionally means it should not timeout. Here it returns immediately. I think we should have JavaDoc for waitForProxy() so that this is explicit. Hopefully these get fixed in HADOOP-4724.

          Show
          Raghu Angadi added a comment - A timeout of '0' conventionally means it should not timeout. Here it returns immediately. I think we should have JavaDoc for waitForProxy() so that this is explicit. Hopefully these get fixed in HADOOP-4724 .
          Hide
          Konstantin Shvachko added a comment -

          I am adding incompatible change flag, because e.g. secondary name-node used to fail after 10 attempts. Now it is trying forever.

          Show
          Konstantin Shvachko added a comment - I am adding incompatible change flag, because e.g. secondary name-node used to fail after 10 attempts. Now it is trying forever.
          Hide
          Raghu Angadi added a comment -

          It is not incompatible w.r.t 0.17.

          Show
          Raghu Angadi added a comment - It is not incompatible w.r.t 0.17.

            People

            • Assignee:
              Steve Loughran
              Reporter:
              Steve Loughran
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development