|
> Limit the total number of connections.
Most operation systems provide some parameters to limit the total number of connections on the machine. It seems that it is unnecessary for the IPC server to concern about it. So I drop this requirement in this jira. Regarding to limiting the resource consumed by the incoming requests, we could either
Solution 1 is simple. Solution 2 provides more accurate memory management and parallel request serialization. Potentially we could do a buffer management to eliminate the frequent buffer allocation & deallocation discussed in HADOOP-2975. But solution 2 is more complicated and not favorable to large messages like block reports. Please comment! Thank you. Let's try the easy solution first and see if it suffices, limiting the call queue length and, when it's too long, refusing to accept new calls.
This patch does not start to read an IPC request when the call queue is full.
This will cause us to busy-wait, no?
On second thought, it won't cause busy-wait, but rather, since the key is removed from the selector, readAndProcess() won't be called again until more data arrives for that connection. Instead, the connection should be re-registered when the callQueue size is back within range. So perhaps we need a queue of connections that have unread input that we put things into when the call queue is full, then pop them off later when the queue empties. Of course, that would defeat the purpose of this patch...
Also, shouldn't we avoid accepting new connections when the queue is full too? It should cause busy wait. The key is removed from selectedKeys, but not canceled. So next select() will return immediately because the socket has read data. The way we handle this while writing is to set interstedOps to 0 (canceling key is pain).
> Also, shouldn't we avoid accepting new connections when the queue is full too? > The way we handle this while writing is to set interstedOps to 0 (canceling key is pain).
Just cancelling is ok in this case I guess. While writing it could be done by multiple threads and that was a problem ( Ok, to avoid busy waiting, the listener checks if the call queue is full or not before it calls select. It waits if the queue is full. How does this sound?
I think it is ok to accept new connections when the queue is full. Otherwise it is going to fail the applications. Our goal is to slow down clients not to fail clients. Waiting while the call queue is full before calling select sounds good. That will block both new connections and new input on existing connections.
> Our goal is to slow down clients not to fail clients. Yes, but at some point we need to backup load onto the clients. We can't queue everything at the server. So if, when the call queue is full we start timing out clients that are connecting, then we can increase the server sockets backlog and/or increase/remove the connect timeout on the client. > Ok, to avoid busy waiting, the listener checks if the call queue is full or not before it calls select. It waits if the queue is full. How does this sound?
+1. This is much simpler than fiddling with the keys. One minor improvement we could have now or in future is to make the selector thread wait for a low water mark (say 90% of max call queue), when it decides to wait for call queue to be drained. this would reduce number of selector loops when the queue is hovering at max. > I think it is ok to accept new connections when the queue is full. Not selecting as you suggested will stop accepting new connections, which I think is good. It will fail clients only in extreme cases (ie queue stays full for many minutes).. in such cases, I guess Remember hadoop-2188 removes the client timeout.
On a second thought, I'd like to implement the call queue as a bounded buffer. Adding an entry to a full queue causes a thread to block and removing an entry from an empty queue also blocks the thread. So no more busy waiting and cleaner interface. This patch implements callQueue as a blocking queue.
Patch looks good. Handlers now mostly ignore 'running' and might wait forever. May be handlers could use poll() rather than take() if we want to keep the functionality. I am not sure how important it is.
> Remember hadoop-2188 removes the client timeout. It does help the application greatly, but not kernel/OS. I guess most TCP stacks are designed to handle this case well.. especially if the connections are not accepted by the application. > Remember hadoop-2188 removes the client timeout.
It only removes the read timeout, but not the connect timeout. Should we now also remove the connect timeout? > Handlers now mostly ignore 'running' and might wait forever.
This won't happen because the stop method intterupts handlers and listener. But yes, I will add a try-catch to handlle InterruptedException so handlers and the listener stop only when running is false. > Should we now also remove the connect timeout? > Or we could have another thread that accepts connections. So it won't block when the call queue is full.
Yes, we could, and that would need to block when there are too many connections. I think it might be simpler to only block on the call queue and remove the connect timeout. The difference is just whether clients making calls when the queue is full get blocked in connect() or in write() – either way, such clients will be blocked until the call queue is smaller. So I don't think we do much better with a separate thread. This patch removes the client-side connection timeout as well.
Note that removing connect timeout does not mean connect will wait as long as server is up. It just makes the timeout to whatever TCP has (around 5 min?). So if the server is very busy and does not accept (i.e. ignores syns), client will still fail (which I think is not that bad).
One thing I am not sure is if there is a possibility where connect() will hang forever.. which would be strange to the user.. there is no equivalent of ping as in Edit : typos. > Note that removing connect timeout does not mean connect will wait as long as server is up.
Yes, I agree. It also depends on the number of the pending connections that TCP/IP allows. If we do not want connect to fail when the server is up, then we need a separate thread to accept connections that do not block when the call queue is full. But I'd like to do it in a different jira. > One thing I am not sure is if there is a possibility where connect() will hang forever.. > But I'd like to do it in a different jira.
From my side, I don't think we need another thread. +1 for the policy in this jira. This patch makes sure that an InterruptedException gets caught by the listener thread, which therefore gets stopped.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12378204/callQueue3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/testReport/ This message is automatically generated. I just committed this. Thanks Hairong!
Integrated in Hadoop-trunk #434 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/434/
> I reverted this patch.
Why? What broke? This patch includes the following changes:
1. Add an acceptor thread responsible for accepting connections; 2. Add a pipe, connecting the acceptor and the listener threads. 3. Whenenver a new connection is accepted, the acceptor notifies the listener thread through the pipe. 4. The listener thread sleeps when the call queue is full. I'm still curious about why the previous version was nixed. I see no discussion here explaining it. Was there something on the list that I missed?
I am sorry, Doug. I did not follow the issue very closely. The discussion is at
I am familiar with the discussion in
The new patch keeps the old IPC server behavior while throttling the clients when the call queue is full. Yes, there is a limit on file descriptors. This patch simply says that it depends on O.S. to decide when to refuse a connection. Having the accepted connection queue in the server helps if we want to limit the total number of connections that the server wants to accept.
Setting the backlog queue length is simple, but different operating systems may its limit on the max length it allows. So it is not portable. And this approach is not equivalent to the old behavior. > The new patch keeps the old IPC server behavior [ ... ]
Which behaviour is that? Isn't the old behaviour buggy? > This patch simply says that it depends on O.S. to decide when to refuse a connection. No, this patch keeps accepting connections until it runs out of file handles, then starts logging exceptions as it tries to accept more but fails because it has run out of file handles. File handle exhaustion may also cause requests in progress that do i/o to fail too. Wouldn't it be more reliable to limit the number of connections? In the interests of simplifying configuration, it seems reasonable to limit this to the call queue length as an upper bound, as was done in the prior version. The backlog isn't a panacea, its simply a way to efficiently queue more clients. On the operating systems where folks run Hadoop, it should be effective at this, so we shouldn't ignore it, should we? > Which behaviour is that?
Server continues to accept connections even when its call queue is full. It does not impose a limitation on the number of connections. > this patch keeps accepting connections until it runs out of file handles. > it seems reasonable to limit this to the call queue length as an upper bound, In addition to Doug's comment, though this patch indents accept all the incoming connections, looks like it is limited by the buffer size for the pipe (4k?).. once this buffer is full, Acceptor thread will spin, right?
Apart from accept policy, since we handle queue differently with the patch, should queue size be larger (at least in the order of number of client NameNode expects).. right now it is 100*handlers. I think 100 is too low. How would the system behave when 12k clients are expected and queue size is 4k? > It does not impose a limitation on the number of connections.
Yes, it does, the limit is just defined elsewhere, e.g., in the number of file handles, or in the buffer size (according to Raghu's analysis). The number of accepted connections will always be limited somewhere. There's an existing mechanism to queue connections: the socket backlog. Shouldn't we use that rather than implement our own queue of accepted connections? Then, with queuing handled by the OS, the application can be simpler. We don't need to accept connections that we're going to ignore, but rather can wait until we're ready to service them before we accept them. We then won't have to worry about running out of file handles, blocking on buffers, etc. Perhaps the socket backlog won't work for this, but I don't see that anyone has tested that yet. I'll write some simple tests & attach them to this issue. > e.g., in the number of file handles, or in the buffer size (according to Raghu's analysis).
This particular one is easy to fix. There is no need write a byte for every connection accepted, as long as there is at least one byte in the pipe, its enough. still... > The number of accepted connections will always be limited somewhere. Here's a simple test program that opens a server socket with a backlog of 100,000, then starts spawning threads that try to connect to it until an exception is thrown. No connections are ever accepted. On my linux laptop it fails after 17,500 connections are queued, when it runs out of threads. My file handle limit and backlog were both set to 100,000 for this.
Doug, what do you conclude from the above? I modified the test to connect in a loop rather than in new threads. What I saw is that 128 or so connected immediately.. Then new connections are tricking slowly (one every a few seconds). These connections are in 'SYN_RECV' on sever side and 'ESTABLISHED' on client side. So the kernel is probably accepting extra connections beyond 128 very slowly. This is on Linux. Not sure if this consistent with your impression. Looks like kernel keeps 40 or so connections in SYN_RECV state and slowly moves them to ESTABLISHED state. I will modify the threaded version to connect using a thread pool of 100 threads and wait for 10min once an exception occurs. We had a discussion with Sameer and Sanjay. I will write up a small proposal that will be an extension of earlier (reverted) patch in the next comment. On windows, client gets connection refused after 200 connects.Everything happens immediately. May be there are some systems level options to make connection handling more reasonable on windows.
attached TestBacklogWithPool.java is modified version of TestBacklog.java. This shows connection delays and that connect() actually fails after 189 seconds.
If we can ignore Windows (may be windows server does not have this problem), we can have couple of minor additions to earlier patch. Ignoring Windows problem, proposal for this jira and for near future improvements: For this jira :
Useful changes, may be in near future :
Here's a slightly modified version. I can get around the thread limit by specifying '-Xss48k' and '-Xms128M'. I also introduced a 5ms pause in the loop, to give the threads more of a fighting chance. It still fails around 16-24k connections, but now with "Connection refused" for the most recent connection attempts. So I changed it to only make 15k connections, then sleep. Now the connect attempts timeout after around four minutes.
To me this indicates that the backlog is effective up to around 16k connections, but that connections queued this way will timeout after four minutes. I also tried Windows and also found that it failed after a few hundred connections and could find no way to tweak OS settings to improve this. +1 for Raghu's proposals for this issue. Adding a connect retry loop on the client causes us to not rely on the backlog or any other connection, instead pushing bursts back to the clients, which is more scalable than trying to handle them on the server. Retrying requests can exacerbate server load, but retrying connects should not.
If I could add one more to this jira, I would increase the queue size to at least 15k or so (or multiply handlers by 250-300).
Should I file another jira for handling file descriptor limits? Regd windows: we should add in FAQ discouraging users from using standard windows for large clusters (especially as NameNode or JobTracker etc.). We should recommend Windows server. We don't need to try hard to write a good server when the OS deliberately tries to avoid it.
I don't think 16k matters at all.. even if we start 100k client or 2k clients, most of them will timeout around the same time, because beyond certain limit, kernel just ignores the SYN packets. This is what happens on my Linux box :
Increasing file descriptor limit and backlog are one-time administrative operations on the OS. So the fix here is a documentation fix: we should describe these parameters and recommend increasing them for large clusters. Things should run fine on 100 node clusters w/o modifying these, no? And, with connect retry, even a large cluster shouldn't require an increased socket backlog. But it will probably require increasing the file descriptor limit. Fortunately that's a pretty well-known parameter already.
Yes. Handling fd limit the server is necessary only to handle some out of control applications creating many clients by mistake. Now, it will essentially bring down the NameNode. Of course, even when we handle the limit, NameNode will still be severely affected. In that sense it is not so urgent. Wow, so many discussions on this jira while I was on vacation.
Ok, after some tests and trys, here comes the patch. It has the following changes:
1. Client connection has a 20 second timeout; 2. Client retries up to 45 times when SocketTimeout occurs. It backs off 1 second before it retries. So a client keeps retying for at least 15 minutes before it declares connnection failures; 3. Server blocks when call queue is full; 4. DoAccept accepts up to 10 connections. Looks good to me. Couple of things :
This patch incorporated Raghu's comments.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12380651/throttleClient2.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/testReport/ This message is automatically generated. Integrated in Hadoop-trunk #469 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/469/
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Currently each IPC request has a header indicating the length of the packet. IPC server reads the length, then allocates a buffer with a size of the length, reads the request into the buffer, marshals it, then put the marshaled request into the call queue. If we want to limit the amount of memory used in the call queue, we can keep a counter tracking the total size of the unmarshaled requests. Whenever the server reads a request from the socket into the buffer, increment the counter. When a handler takes the request out of the queue, decrement the counter. It stops reading from the socket if the incoming request will make the counter to exceed the max size limit.