|
No, the ping will only be sent if there are outstanding rpc calls, not when the connection is idle. It is still right to close the connection if it has been idle too long.
This patch hasn't been tested at scale, but it passes unit tests.
It changes: Owen, could describe the policy on Server with the patch? (I mean what and how it fixes...)
for e.g. :
It would be better if the server stopped reading the incoming queue, but I believe it is not critical for a first pass. The reason is that without the client side timeouts, the size of the queue is bounded by the number of client threads calling the server. So even if you have 2k servers running 4 maps each with 5 threads each, that is still only 40k pending calls, which won't be enough to blow out memory.
The approach looks good. The iterator typing adds noise to the patch.
I'm confused by the 'out.close()' call in sendPing(). Is that intended? Perhaps the ping interval should be configurable (undocumented) and a test case could be added that sets a short ping time, shuts down a server, and sees that the client fails promptly? > 7. Explicitly kill the active rpcs when a connection is broken.
Owen, I could not see where this is done in the patch. Could you point to the file that has this change? Also, client used to wait inside waitForWor() if there are no pending RPCs, but now it waits on read() from socket. Is that intended? Thanks for the catch, Doug. The close should have been a flush. And you are right that I probably should make a hidden switch to control the ping rate.
Raghu, the rpcs are killed at the top of Client.Connection.close. I'll think about the difference between waiting in the waitForWork versus the readint. smile Does the server side continue to have a timeout?
Seems like it should, badly behaved clients could cause a server brownout by sending RPCs, never reading the responses and so occupying all the handlers. > the rpcs are killed at the top of Client.Connection.close.
hmm.. client side does not help the server side. One big issue at the server when it is loaded is that we will have a lot of RPCs waiting and server patiently executes all of them only to realize that client side has closed the connection for many of them, this in turn slows down the other clients, they will close the connections. I think we need Sever to avoid processing RPCs from connections which are closed already... each handler could check if the connection has a error. > One big issue at the server when it is loaded is that we will have a lot of RPCs waiting and server...
with this patch client is not going to close the connection.. so the server problem may not as bad. That depends on how we handle the individual RPC timeouts What about the timeout for RPCs (set by users) as opposed to timeout for connection. There is no use of Server excuting RPC that are timedout on client side. Owen's proposal looks good. Here are a few things I plan to do at the client side:
1. Remove client.call.timeout from Configuration; 2. Connection thread waits for response with a timeout of 1 minute. 3. If it does not receive a response, it pings the server to detect server side failure. If failure is detected, mark all pending calls as done. 4. Remove ConnectionCuller thread. Instead Connection thread detects idle connection itself in waitForWork. 5. Application thread does not timeout checking for results. Instead it waits until a call is done; In addition to the changes I proposed above, it has the following changes:
1. remove the field inUse in Client.Connection. In stead use calls.size(). 2. Start the connection thread after I/O streams are set up. 3. Remove call.setResult, instead use setValuse and setException. This patch is posted for review. It passed junit tests. I will perform more extensive large scale tests. This patch additionally fixed a lot of race conditions in the IPC Client code.
This patch makes sure that the right stack trace is returned when a rpc has an error.
minor nits:
Patch 3 incorporated Doug's comments.
+1 This looks fine to me. Have you yet had a chance to test it on a large cluster?
I am looking at the patch too.. though still not done. Couple of queries :
receiveResponse() modifies calls without synchronizing on 'this', but other accesses to it do. As you mentioned there is more synchronization involved. It is harder check the correctness with 'isClosed' etc. For e.g. it took some time to see what happens sendParam() returns silently when isClosed is true. Raghu, thanks for the review comments.
> Why does the patch remove purging the connections on server? > receiveResponse() modifies calls without synchronizing on 'this', but other accesses to it do.
Calls is of type HashTable, which is a synchronized data structure. There is no need to hold the connection lock. > The purpose of this jira is not to throw away any calls.
True. But what should server do if some clients just don't read from the sockets? I think purging exists only to handle exceptional cases like (unintentionally) rogue clients. One actual case that happened is that one user accindentally started thousands of clients from one machine and these clients could not read. > Otherwise the applications will wait for the result for ever. In any case purging is usually not required and almost no calls are purged while writing in practice.. we won't notice its absence until something bad happens. +0 for removing it, from me. I don't think that should hold up the patch if you are ok with it. > As you mentioned there is more synchronization involved. It is harder check the correctness with 'isClosed' etc. For e.g. it took some time to see what happens sendParam() returns silently when isClosed is true.
Since this patch removes SocketTimeoutException, it exposes quite a lot incorrect synchronizations in the code. Previously applications receive a SocketTimeoutException when a call is lost but now applications get stuck for ever. It took me quite a lot of energy to debug and sort out the synchronization part. Thank you for taking time to check its correctness. > what should server do if some clients just don't read from the sockets? I think purging exists only to handle exceptional cases like (unintentionally) rogue clients. One actual case that happened is that one user accindentally started thousands of clients from one machine and these clients could not read. I think we should assume that clients uses IPC Client to talk to the IPC server, so no worry about their not reading from the sockets. In the case of 1000 clients per machine, if they all can send requests, why could not they read? Could you up load the latest patch that applies to trunk?
This patch incorporates Raghu's comments. It also fixed a few bugs found when running JUnit testcases.
+1.
The patch deletes "hadoop.job.history.location" from hadoop-default.xml. Not sure if this is intentional. The new patch removed the change Raghu mentioned.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378362/ipc-timeout4.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 9 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2021/testReport/ This message is automatically generated. Fixed javadoc and findbugs warnings.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378398/ipc-timeout5.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 9 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2024/testReport/ This message is automatically generated. This patch applies to the trunk. And one difference from the previous patch is that it keeps the purge code, which closes a connection if the connection has a pending response in the queue for more than 15 minutes.
+1. I looked only at the recent change.
We could set {{call.timestamp only inside doRespond() when it is queue to the responder. This will avoid two calls to getTime() for each RPC in the normal case... not that it will really be noticeable. This patch additionally has the following changes:
1. In Server, a call's time stamp is updated when it is received and when its reponse is sent to the Responder. 2. Client.Connection.in is accessed by only the Connection thread, so its lock is no longer to be acquired. This eliminates a possible deadlock in the code. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380967/ipc-timeout7.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 9 new or modified tests. javadoc -1. The javadoc tool appears to have generated 1 warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2332/testReport/ This message is automatically generated. +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381200/ipc-timeout8.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 9 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2352/testReport/ This message is automatically generated. This patch bumps up the RPC version number. It also fixes a bug occured during large scale testing.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381353/ipc-timeout9.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included +1. The patch appears to include 9 new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2375/testReport/ This message is automatically generated. GridMix ran successfully on 200-node cluster without any performance degradation.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12381353/ipc-timeout9.patch against trunk revision 653552. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2398/testReport/ This message is automatically generated. I just committed this.
The failure of TestTrash in the second to last hudson test seems not related to this issue. Multiple reruns of the unit test could not reproduce the failure. Integrated in Hadoop-trunk #483 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/483/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HADOOP-312). What you are proposing will effectively enable connection caching. Did I get you right? (BTW, for the task->tasktracker communication, the idle timeout is set to a very high number since the tasktracker server has to deal with only a very few number of clients).