Hadoop Common
  1. Hadoop Common
  2. HADOOP-2910

Throttle IPC Client/Server during bursts of requests or server slowdown

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.16.0
    • Fix Version/s: 0.18.0
    • Component/s: ipc
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I propose the following to avoid an IPC server being swarmed by too many requests and connections
      1. Limit call queue length or limit the amount of memory used in the call queue. This can be done by including the size of a request in the header and storing unmarshaled requests in the call queue.
      2. If the call queue is full or queue buffer is full, stop reading requests from sockets. So requests stay at the server's system buffer or at the client side and thus eventually throttle the client.
      3. Limit the total number of connections. Do not accept new connections if the connection limit is exceeded. (Note: this solution is unfair to new connections.)
      4. If receive out of memory exception, close the current connection.

      1. callQueue.patch
        0.8 kB
        Hairong Kuang
      2. callQueue1.patch
        3 kB
        Hairong Kuang
      3. callQueue2.patch
        3 kB
        Hairong Kuang
      4. callQueue3.patch
        4 kB
        Hairong Kuang
      5. throttleClient.patch
        17 kB
        Hairong Kuang
      6. TestBacklog.java
        0.7 kB
        Doug Cutting
      7. TestBacklogWithPool.java
        2 kB
        Raghu Angadi
      8. TestBacklog.java
        0.9 kB
        Doug Cutting
      9. throttleClient1.patch
        10 kB
        Hairong Kuang
      10. throttleClient2.patch
        10 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #469 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/469/ )
          Hide
          Hairong Kuang added a comment -

          I just committed this.

          Show
          Hairong Kuang added a comment - I just committed this.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12380651/throttleClient2.patch
          against trunk revision 645773.

          @author +1. The patch does not contain any @author tags.

          tests included -1. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12380651/throttleClient2.patch against trunk revision 645773. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2297/console This message is automatically generated.
          Hide
          Raghu Angadi added a comment -

          +1.

          Show
          Raghu Angadi added a comment - +1.
          Hide
          Hairong Kuang added a comment -

          This patch incorporated Raghu's comments.

          Show
          Hairong Kuang added a comment - This patch incorporated Raghu's comments.
          Hide
          Raghu Angadi added a comment -

          Looks good to me. Couple of things :

          1. socket is closed only when max retries are reached. Should it be closed on every error? Looks like this problem existed before.
          2. Could you leave a comment about the numbers 20000 and 45 (may be saying it amounts to15 min of wait to connect to a server).
          3. (minor): 15 min is pretty long, though it could be higher. I guess if it turns out to be not long enough in some cases, it could be increased.
          Show
          Raghu Angadi added a comment - Looks good to me. Couple of things : socket is closed only when max retries are reached. Should it be closed on every error? Looks like this problem existed before. Could you leave a comment about the numbers 20000 and 45 (may be saying it amounts to15 min of wait to connect to a server). (minor): 15 min is pretty long, though it could be higher. I guess if it turns out to be not long enough in some cases, it could be increased.
          Hide
          Hairong Kuang added a comment -

          Ok, after some tests and trys, here comes the patch. It has the following changes:
          1. Client connection has a 20 second timeout;
          2. Client retries up to 45 times when SocketTimeout occurs. It backs off 1 second before it retries. So a client keeps retying for at least 15 minutes before it declares connnection failures;
          3. Server blocks when call queue is full;
          4. DoAccept accepts up to 10 connections.

          Show
          Hairong Kuang added a comment - Ok, after some tests and trys, here comes the patch. It has the following changes: 1. Client connection has a 20 second timeout; 2. Client retries up to 45 times when SocketTimeout occurs. It backs off 1 second before it retries. So a client keeps retying for at least 15 minutes before it declares connnection failures; 3. Server blocks when call queue is full; 4. DoAccept accepts up to 10 connections.
          Hide
          Raghu Angadi added a comment -
          • Why do we want exponential backoff? Short timeout in the proposal was intentional. For e.g., if the server is too busy for 2 minutes, it would better if client connects in 2min 15sec rather than 4 minutes. I don't think we really need to throttle SYN packets.
          • Do we want the overall timeout to be just 10 min? Note that original patch (callQueue3.patch) removed the timeout for connect() (may be with the intention of waiting for as long as necessary).
          Show
          Raghu Angadi added a comment - Why do we want exponential backoff? Short timeout in the proposal was intentional. For e.g., if the server is too busy for 2 minutes, it would better if client connects in 2min 15sec rather than 4 minutes. I don't think we really need to throttle SYN packets. Do we want the overall timeout to be just 10 min? Note that original patch (callQueue3.patch) removed the timeout for connect() (may be with the intention of waiting for as long as necessary).
          Hide
          Hairong Kuang added a comment -

          Wow, so many discussions on this jira while I was on vacation. I am OK with Raghu's proposal. I plan to use an exponential back off algorithm when a connect operation timeouts. A client retries connection for 10 minutes.

          Show
          Hairong Kuang added a comment - Wow, so many discussions on this jira while I was on vacation. I am OK with Raghu's proposal. I plan to use an exponential back off algorithm when a connect operation timeouts. A client retries connection for 10 minutes.
          Hide
          Raghu Angadi added a comment -


          Yes. Handling fd limit the server is necessary only to handle some out of control applications creating many clients by mistake. Now, it will essentially bring down the NameNode. Of course, even when we handle the limit, NameNode will still be severely affected. In that sense it is not so urgent.

          Show
          Raghu Angadi added a comment - Yes. Handling fd limit the server is necessary only to handle some out of control applications creating many clients by mistake. Now, it will essentially bring down the NameNode. Of course, even when we handle the limit, NameNode will still be severely affected. In that sense it is not so urgent.
          Hide
          Doug Cutting added a comment -

          Increasing file descriptor limit and backlog are one-time administrative operations on the OS. So the fix here is a documentation fix: we should describe these parameters and recommend increasing them for large clusters. Things should run fine on 100 node clusters w/o modifying these, no? And, with connect retry, even a large cluster shouldn't require an increased socket backlog. But it will probably require increasing the file descriptor limit. Fortunately that's a pretty well-known parameter already.

          Show
          Doug Cutting added a comment - Increasing file descriptor limit and backlog are one-time administrative operations on the OS. So the fix here is a documentation fix: we should describe these parameters and recommend increasing them for large clusters. Things should run fine on 100 node clusters w/o modifying these, no? And, with connect retry, even a large cluster shouldn't require an increased socket backlog. But it will probably require increasing the file descriptor limit. Fortunately that's a pretty well-known parameter already.
          Hide
          Raghu Angadi added a comment - - edited

          If I could add one more to this jira, I would increase the queue size to at least 15k or so (or multiply handlers by 250-300).

          Should I file another jira for handling file descriptor limits?

          Regd windows: we should add in FAQ discouraging users from using standard windows for large clusters (especially as NameNode or JobTracker etc.). We should recommend Windows server. We don't need to try hard to write a good server when the OS deliberately tries to avoid it.

          To me this indicates that the backlog is effective up to around 16k connections, but that connections queued this way will timeout after four minutes.

          I don't think 16k matters at all.. even if we start 100k client or 2k clients, most of them will timeout around the same time, because beyond certain limit, kernel just ignores the SYN packets. This is what happens on my Linux box :

          1. It accepts 131 connections immediately.
          2. Delays responding to 40 or so connections.
          3. Then accepts 2 connections more every 9-10 seconds.
          4. Most clients will timeout around 3-4 minutes because their SYNs are lost.
          Show
          Raghu Angadi added a comment - - edited If I could add one more to this jira, I would increase the queue size to at least 15k or so (or multiply handlers by 250-300). Should I file another jira for handling file descriptor limits? Regd windows: we should add in FAQ discouraging users from using standard windows for large clusters (especially as NameNode or JobTracker etc.). We should recommend Windows server. We don't need to try hard to write a good server when the OS deliberately tries to avoid it. To me this indicates that the backlog is effective up to around 16k connections, but that connections queued this way will timeout after four minutes. I don't think 16k matters at all.. even if we start 100k client or 2k clients, most of them will timeout around the same time, because beyond certain limit, kernel just ignores the SYN packets. This is what happens on my Linux box : It accepts 131 connections immediately. Delays responding to 40 or so connections. Then accepts 2 connections more every 9-10 seconds. Most clients will timeout around 3-4 minutes because their SYNs are lost.
          Hide
          Doug Cutting added a comment -

          +1 for Raghu's proposals for this issue. Adding a connect retry loop on the client causes us to not rely on the backlog or any other connection, instead pushing bursts back to the clients, which is more scalable than trying to handle them on the server. Retrying requests can exacerbate server load, but retrying connects should not.

          Show
          Doug Cutting added a comment - +1 for Raghu's proposals for this issue. Adding a connect retry loop on the client causes us to not rely on the backlog or any other connection, instead pushing bursts back to the clients, which is more scalable than trying to handle them on the server. Retrying requests can exacerbate server load, but retrying connects should not.
          Hide
          Doug Cutting added a comment -

          Here's a slightly modified version. I can get around the thread limit by specifying '-Xss48k' and '-Xms128M'. I also introduced a 5ms pause in the loop, to give the threads more of a fighting chance. It still fails around 16-24k connections, but now with "Connection refused" for the most recent connection attempts. So I changed it to only make 15k connections, then sleep. Now the connect attempts timeout after around four minutes.

          To me this indicates that the backlog is effective up to around 16k connections, but that connections queued this way will timeout after four minutes.

          I also tried Windows and also found that it failed after a few hundred connections and could find no way to tweak OS settings to improve this.

          Show
          Doug Cutting added a comment - Here's a slightly modified version. I can get around the thread limit by specifying '-Xss48k' and '-Xms128M'. I also introduced a 5ms pause in the loop, to give the threads more of a fighting chance. It still fails around 16-24k connections, but now with "Connection refused" for the most recent connection attempts. So I changed it to only make 15k connections, then sleep. Now the connect attempts timeout after around four minutes. To me this indicates that the backlog is effective up to around 16k connections, but that connections queued this way will timeout after four minutes. I also tried Windows and also found that it failed after a few hundred connections and could find no way to tweak OS settings to improve this.
          Hide
          Raghu Angadi added a comment -

          Ignoring Windows problem, proposal for this jira and for near future improvements:

          For this jira :

          1. callQueue3.patch
          2. Increase default backlog to a large value. There is no advantage to a smaller value.
          3. Make the client connect time out short (15 sec?). Client retries many times (may be for couple of hours) in case of timeout. A short timeout is better rather than waiting for 189 sec TCP timeout mentioned above since short timeout will cap the connection latency to 15 secs when the server is temporarily busy.
          4. make Server.doAccept() accept more than one connection at a time. Something like 10 or 100 each time (not 100% required).

          Useful changes, may be in near future :

          1. Limit the total number of accepted connections, so that Server does not run out of file descriptors. This limit could be something like 90% of fd limit of the process (if we get hold of such a limit).
          2. Make the max queue size proportional to clients rather than handlers. If server does not read the RPC, then kernel ends up paying for memory instead of the process. Even with 100k requests, it probably does not take much more than 100MB. I suggest something like 20-30 k.
          3. With a larger queue size, there could be worst case that takes lot of memory (e.g. block reports from 4k datanodes).. So Server could have memory limit as well.
            • This memory calculation need not be very accurate. This could be proportional to size of the RPC request on the wire (say 1.5 times bytes read from socket for an RPC, plus length of write data).
          Show
          Raghu Angadi added a comment - Ignoring Windows problem, proposal for this jira and for near future improvements: For this jira : callQueue3.patch Increase default backlog to a large value. There is no advantage to a smaller value. Make the client connect time out short (15 sec?). Client retries many times (may be for couple of hours) in case of timeout. A short timeout is better rather than waiting for 189 sec TCP timeout mentioned above since short timeout will cap the connection latency to 15 secs when the server is temporarily busy. make Server.doAccept() accept more than one connection at a time. Something like 10 or 100 each time (not 100% required). Useful changes, may be in near future : Limit the total number of accepted connections, so that Server does not run out of file descriptors. This limit could be something like 90% of fd limit of the process (if we get hold of such a limit). Make the max queue size proportional to clients rather than handlers. If server does not read the RPC, then kernel ends up paying for memory instead of the process. Even with 100k requests, it probably does not take much more than 100MB. I suggest something like 20-30 k. With a larger queue size, there could be worst case that takes lot of memory (e.g. block reports from 4k datanodes).. So Server could have memory limit as well. This memory calculation need not be very accurate. This could be proportional to size of the RPC request on the wire (say 1.5 times bytes read from socket for an RPC, plus length of write data).
          Hide
          Raghu Angadi added a comment -

          attached TestBacklogWithPool.java is modified version of TestBacklog.java. This shows connection delays and that connect() actually fails after 189 seconds.

          If we can ignore Windows (may be windows server does not have this problem), we can have couple of minor additions to earlier patch.

          Show
          Raghu Angadi added a comment - attached TestBacklogWithPool.java is modified version of TestBacklog.java. This shows connection delays and that connect() actually fails after 189 seconds. If we can ignore Windows (may be windows server does not have this problem), we can have couple of minor additions to earlier patch.
          Hide
          Raghu Angadi added a comment -

          On windows, client gets connection refused after 200 connects.Everything happens immediately. May be there are some systems level options to make connection handling more reasonable on windows.

          Show
          Raghu Angadi added a comment - On windows, client gets connection refused after 200 connects.Everything happens immediately. May be there are some systems level options to make connection handling more reasonable on windows.
          Hide
          Raghu Angadi added a comment -

          Doug, what do you conclude from the above?

          I modified the test to connect in a loop rather than in new threads. What I saw is that 128 or so connected immediately.. Then new connections are tricking slowly (one every a few seconds). These connections are in 'SYN_RECV' on sever side and 'ESTABLISHED' on client side. So the kernel is probably accepting extra connections beyond 128 very slowly. This is on Linux. Not sure if this consistent with your impression. Looks like kernel keeps 40 or so connections in SYN_RECV state and slowly moves them to ESTABLISHED state.

          I will modify the threaded version to connect using a thread pool of 100 threads and wait for 10min once an exception occurs.

          We had a discussion with Sameer and Sanjay. I will write up a small proposal that will be an extension of earlier (reverted) patch in the next comment.

          Show
          Raghu Angadi added a comment - Doug, what do you conclude from the above? I modified the test to connect in a loop rather than in new threads. What I saw is that 128 or so connected immediately.. Then new connections are tricking slowly (one every a few seconds). These connections are in 'SYN_RECV' on sever side and 'ESTABLISHED' on client side. So the kernel is probably accepting extra connections beyond 128 very slowly. This is on Linux. Not sure if this consistent with your impression. Looks like kernel keeps 40 or so connections in SYN_RECV state and slowly moves them to ESTABLISHED state. I will modify the threaded version to connect using a thread pool of 100 threads and wait for 10min once an exception occurs. We had a discussion with Sameer and Sanjay. I will write up a small proposal that will be an extension of earlier (reverted) patch in the next comment.
          Hide
          Doug Cutting added a comment -

          Here's a simple test program that opens a server socket with a backlog of 100,000, then starts spawning threads that try to connect to it until an exception is thrown. No connections are ever accepted. On my linux laptop it fails after 17,500 connections are queued, when it runs out of threads. My file handle limit and backlog were both set to 100,000 for this.

          Show
          Doug Cutting added a comment - Here's a simple test program that opens a server socket with a backlog of 100,000, then starts spawning threads that try to connect to it until an exception is thrown. No connections are ever accepted. On my linux laptop it fails after 17,500 connections are queued, when it runs out of threads. My file handle limit and backlog were both set to 100,000 for this.
          Hide
          Raghu Angadi added a comment -

          > e.g., in the number of file handles, or in the buffer size (according to Raghu's analysis).
          This particular one is easy to fix. There is no need write a byte for every connection accepted, as long as there is at least one byte in the pipe, its enough. still...

          > The number of accepted connections will always be limited somewhere.
          this is true.

          Show
          Raghu Angadi added a comment - > e.g., in the number of file handles, or in the buffer size (according to Raghu's analysis). This particular one is easy to fix. There is no need write a byte for every connection accepted, as long as there is at least one byte in the pipe, its enough. still... > The number of accepted connections will always be limited somewhere. this is true.
          Hide
          Doug Cutting added a comment -

          > It does not impose a limitation on the number of connections.

          Yes, it does, the limit is just defined elsewhere, e.g., in the number of file handles, or in the buffer size (according to Raghu's analysis). The number of accepted connections will always be limited somewhere.

          There's an existing mechanism to queue connections: the socket backlog. Shouldn't we use that rather than implement our own queue of accepted connections? Then, with queuing handled by the OS, the application can be simpler. We don't need to accept connections that we're going to ignore, but rather can wait until we're ready to service them before we accept them. We then won't have to worry about running out of file handles, blocking on buffers, etc.

          Perhaps the socket backlog won't work for this, but I don't see that anyone has tested that yet. I'll write some simple tests & attach them to this issue.

          Show
          Doug Cutting added a comment - > It does not impose a limitation on the number of connections. Yes, it does, the limit is just defined elsewhere, e.g., in the number of file handles, or in the buffer size (according to Raghu's analysis). The number of accepted connections will always be limited somewhere. There's an existing mechanism to queue connections: the socket backlog. Shouldn't we use that rather than implement our own queue of accepted connections? Then, with queuing handled by the OS, the application can be simpler. We don't need to accept connections that we're going to ignore, but rather can wait until we're ready to service them before we accept them. We then won't have to worry about running out of file handles, blocking on buffers, etc. Perhaps the socket backlog won't work for this, but I don't see that anyone has tested that yet. I'll write some simple tests & attach them to this issue.
          Hide
          Raghu Angadi added a comment -

          In addition to Doug's comment, though this patch indents accept all the incoming connections, looks like it is limited by the buffer size for the pipe (4k?).. once this buffer is full, Acceptor thread will spin, right?

          Apart from accept policy, since we handle queue differently with the patch, should queue size be larger (at least in the order of number of client NameNode expects).. right now it is 100*handlers. I think 100 is too low. How would the system behave when 12k clients are expected and queue size is 4k?

          Show
          Raghu Angadi added a comment - In addition to Doug's comment, though this patch indents accept all the incoming connections, looks like it is limited by the buffer size for the pipe (4k?).. once this buffer is full, Acceptor thread will spin, right? Apart from accept policy, since we handle queue differently with the patch, should queue size be larger (at least in the order of number of client NameNode expects).. right now it is 100*handlers. I think 100 is too low. How would the system behave when 12k clients are expected and queue size is 4k?
          Hide
          Hairong Kuang added a comment -

          > Which behaviour is that?
          Server continues to accept connections even when its call queue is full. It does not impose a limitation on the number of connections.

          > this patch keeps accepting connections until it runs out of file handles.
          The number of connections that a server can accept might be far less than the number of file handles. I think TCP/IP will reject a connection before the acceptor calls accept.

          > it seems reasonable to limit this to the call queue length as an upper bound,
          call queue length is proportional to the number of handlers but not proportional to the number of clients in the system. Also one connection may have more than one requests in the call queue.

          Show
          Hairong Kuang added a comment - > Which behaviour is that? Server continues to accept connections even when its call queue is full. It does not impose a limitation on the number of connections. > this patch keeps accepting connections until it runs out of file handles. The number of connections that a server can accept might be far less than the number of file handles. I think TCP/IP will reject a connection before the acceptor calls accept. > it seems reasonable to limit this to the call queue length as an upper bound, call queue length is proportional to the number of handlers but not proportional to the number of clients in the system. Also one connection may have more than one requests in the call queue.
          Hide
          Doug Cutting added a comment -

          > The new patch keeps the old IPC server behavior [ ... ]

          Which behaviour is that? Isn't the old behaviour buggy?

          > This patch simply says that it depends on O.S. to decide when to refuse a connection.

          No, this patch keeps accepting connections until it runs out of file handles, then starts logging exceptions as it tries to accept more but fails because it has run out of file handles. File handle exhaustion may also cause requests in progress that do i/o to fail too.

          Wouldn't it be more reliable to limit the number of connections? In the interests of simplifying configuration, it seems reasonable to limit this to the call queue length as an upper bound, as was done in the prior version.

          The backlog isn't a panacea, its simply a way to efficiently queue more clients. On the operating systems where folks run Hadoop, it should be effective at this, so we shouldn't ignore it, should we?

          Show
          Doug Cutting added a comment - > The new patch keeps the old IPC server behavior [ ... ] Which behaviour is that? Isn't the old behaviour buggy? > This patch simply says that it depends on O.S. to decide when to refuse a connection. No, this patch keeps accepting connections until it runs out of file handles, then starts logging exceptions as it tries to accept more but fails because it has run out of file handles. File handle exhaustion may also cause requests in progress that do i/o to fail too. Wouldn't it be more reliable to limit the number of connections? In the interests of simplifying configuration, it seems reasonable to limit this to the call queue length as an upper bound, as was done in the prior version. The backlog isn't a panacea, its simply a way to efficiently queue more clients. On the operating systems where folks run Hadoop, it should be effective at this, so we shouldn't ignore it, should we?
          Hide
          Hairong Kuang added a comment -

          The new patch keeps the old IPC server behavior while throttling the clients when the call queue is full. Yes, there is a limit on file descriptors. This patch simply says that it depends on O.S. to decide when to refuse a connection. Having the accepted connection queue in the server helps if we want to limit the total number of connections that the server wants to accept.

          Setting the backlog queue length is simple, but different operating systems may its limit on the max length it allows. So it is not portable. And this approach is not equivalent to the old behavior.

          Show
          Hairong Kuang added a comment - The new patch keeps the old IPC server behavior while throttling the clients when the call queue is full. Yes, there is a limit on file descriptors. This patch simply says that it depends on O.S. to decide when to refuse a connection. Having the accepted connection queue in the server helps if we want to limit the total number of connections that the server wants to accept. Setting the backlog queue length is simple, but different operating systems may its limit on the max length it allows. So it is not portable. And this approach is not equivalent to the old behavior.
          Hide
          Doug Cutting added a comment -

          I am familiar with the discussion in HADOOP-3109, and am still unsure why this new approach is preferred. The number of connections is limited by the file handle limit, so at some point we need to stop accepting new connections. The listen queue is a good place to do this without outright refusing the connections. It too is limited in size, but its default value can easily be increased. Wouldn't that be a simpler and more scalable approach than adding a new thread and queue in the namenode?

          Show
          Doug Cutting added a comment - I am familiar with the discussion in HADOOP-3109 , and am still unsure why this new approach is preferred. The number of connections is limited by the file handle limit, so at some point we need to stop accepting new connections. The listen queue is a good place to do this without outright refusing the connections. It too is limited in size, but its default value can easily be increased. Wouldn't that be a simpler and more scalable approach than adding a new thread and queue in the namenode?
          Hide
          Hairong Kuang added a comment -

          I am sorry, Doug. I did not follow the issue very closely. The discussion is at HADOOP-3109. With the old patch, we were concerned that an IPC server stops accepting new connections when the call queue is full. This changes the server behavior. The new patch allows the server to continue accepting new connections when the call queue becomes full.

          Show
          Hairong Kuang added a comment - I am sorry, Doug. I did not follow the issue very closely. The discussion is at HADOOP-3109 . With the old patch, we were concerned that an IPC server stops accepting new connections when the call queue is full. This changes the server behavior. The new patch allows the server to continue accepting new connections when the call queue becomes full.
          Hide
          Doug Cutting added a comment -

          I'm still curious about why the previous version was nixed. I see no discussion here explaining it. Was there something on the list that I missed?

          Show
          Doug Cutting added a comment - I'm still curious about why the previous version was nixed. I see no discussion here explaining it. Was there something on the list that I missed?
          Hide
          Hairong Kuang added a comment -

          This patch includes the following changes:
          1. Add an acceptor thread responsible for accepting connections;
          2. Add a pipe, connecting the acceptor and the listener threads.
          3. Whenenver a new connection is accepted, the acceptor notifies the listener thread through the pipe.
          4. The listener thread sleeps when the call queue is full.

          Show
          Hairong Kuang added a comment - This patch includes the following changes: 1. Add an acceptor thread responsible for accepting connections; 2. Add a pipe, connecting the acceptor and the listener threads. 3. Whenenver a new connection is accepted, the acceptor notifies the listener thread through the pipe. 4. The listener thread sleeps when the call queue is full.
          Hide
          Doug Cutting added a comment -

          > I reverted this patch.

          Why? What broke?

          Show
          Doug Cutting added a comment - > I reverted this patch. Why? What broke?
          Hide
          Hairong Kuang added a comment -

          I reverted this patch.

          Show
          Hairong Kuang added a comment - I reverted this patch.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #434 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/434/ )
          Hide
          dhruba borthakur added a comment -

          I just committed this. Thanks Hairong!

          Show
          dhruba borthakur added a comment - I just committed this. Thanks Hairong!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12378204/callQueue3.patch
          against trunk revision 619744.

          @author +1. The patch does not contain any @author tags.

          tests included -1. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new javac compiler warnings.

          release audit +1. The applied patch does not generate any new release audit warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests -1. The patch failed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12378204/callQueue3.patch against trunk revision 619744. @author +1. The patch does not contain any @author tags. tests included -1. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new javac compiler warnings. release audit +1. The applied patch does not generate any new release audit warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests -1. The patch failed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1996/console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          This patch makes sure that an InterruptedException gets caught by the listener thread, which therefore gets stopped.

          Show
          Hairong Kuang added a comment - This patch makes sure that an InterruptedException gets caught by the listener thread, which therefore gets stopped.
          Hide
          Raghu Angadi added a comment -

          > But I'd like to do it in a different jira.

          From my side, I don't think we need another thread. +1 for the policy in this jira.

          Show
          Raghu Angadi added a comment - > But I'd like to do it in a different jira. From my side, I don't think we need another thread. +1 for the policy in this jira.
          Hide
          Hairong Kuang added a comment -

          > Note that removing connect timeout does not mean connect will wait as long as server is up.
          Yes, I agree. It also depends on the number of the pending connections that TCP/IP allows. If we do not want connect to fail when the server is up, then we need a separate thread to accept connections that do not block when the call queue is full. But I'd like to do it in a different jira.

          > One thing I am not sure is if there is a possibility where connect() will hang forever..
          I do not think so. Connect is more like send, so the client can detect if the server is down or not. HADOOP-2188 handles the case that receive is not able to detect when the server machine gets power down unexpectedly.

          Show
          Hairong Kuang added a comment - > Note that removing connect timeout does not mean connect will wait as long as server is up. Yes, I agree. It also depends on the number of the pending connections that TCP/IP allows. If we do not want connect to fail when the server is up, then we need a separate thread to accept connections that do not block when the call queue is full. But I'd like to do it in a different jira. > One thing I am not sure is if there is a possibility where connect() will hang forever.. I do not think so. Connect is more like send, so the client can detect if the server is down or not. HADOOP-2188 handles the case that receive is not able to detect when the server machine gets power down unexpectedly.
          Hide
          Raghu Angadi added a comment - - edited

          Note that removing connect timeout does not mean connect will wait as long as server is up. It just makes the timeout to whatever TCP has (around 5 min?). So if the server is very busy and does not accept (i.e. ignores syns), client will still fail (which I think is not that bad).

          One thing I am not sure is if there is a possibility where connect() will hang forever.. which would be strange to the user.. there is no equivalent of ping as in HADOOP-2188.

          Edit : typos.

          Show
          Raghu Angadi added a comment - - edited Note that removing connect timeout does not mean connect will wait as long as server is up. It just makes the timeout to whatever TCP has (around 5 min?). So if the server is very busy and does not accept (i.e. ignores syns), client will still fail (which I think is not that bad). One thing I am not sure is if there is a possibility where connect() will hang forever.. which would be strange to the user.. there is no equivalent of ping as in HADOOP-2188 . Edit : typos.
          Hide
          Hairong Kuang added a comment -

          This patch removes the client-side connection timeout as well.

          Show
          Hairong Kuang added a comment - This patch removes the client-side connection timeout as well.
          Hide
          Doug Cutting added a comment -

          > Or we could have another thread that accepts connections. So it won't block when the call queue is full.

          Yes, we could, and that would need to block when there are too many connections. I think it might be simpler to only block on the call queue and remove the connect timeout. The difference is just whether clients making calls when the queue is full get blocked in connect() or in write() – either way, such clients will be blocked until the call queue is smaller. So I don't think we do much better with a separate thread.

          Show
          Doug Cutting added a comment - > Or we could have another thread that accepts connections. So it won't block when the call queue is full. Yes, we could, and that would need to block when there are too many connections. I think it might be simpler to only block on the call queue and remove the connect timeout. The difference is just whether clients making calls when the queue is full get blocked in connect() or in write() – either way, such clients will be blocked until the call queue is smaller. So I don't think we do much better with a separate thread.
          Hide
          Hairong Kuang added a comment -

          > Handlers now mostly ignore 'running' and might wait forever.
          This won't happen because the stop method intterupts handlers and listener. But yes, I will add a try-catch to handlle InterruptedException so handlers and the listener stop only when running is false.

          > Should we now also remove the connect timeout?
          I thought about it too. Yes, I think we can do it in this way. Or we could have another thread that accepts connections. So it won't block when the call queue is full.

          Show
          Hairong Kuang added a comment - > Handlers now mostly ignore 'running' and might wait forever. This won't happen because the stop method intterupts handlers and listener. But yes, I will add a try-catch to handlle InterruptedException so handlers and the listener stop only when running is false. > Should we now also remove the connect timeout? I thought about it too. Yes, I think we can do it in this way. Or we could have another thread that accepts connections. So it won't block when the call queue is full.
          Hide
          Doug Cutting added a comment -

          > Remember hadoop-2188 removes the client timeout.

          It only removes the read timeout, but not the connect timeout. Should we now also remove the connect timeout?

          Show
          Doug Cutting added a comment - > Remember hadoop-2188 removes the client timeout. It only removes the read timeout, but not the connect timeout. Should we now also remove the connect timeout?
          Hide
          Raghu Angadi added a comment -

          Patch looks good. Handlers now mostly ignore 'running' and might wait forever. May be handlers could use poll() rather than take() if we want to keep the functionality. I am not sure how important it is.

          > Remember hadoop-2188 removes the client timeout.

          It does help the application greatly, but not kernel/OS. I guess most TCP stacks are designed to handle this case well.. especially if the connections are not accepted by the application.

          Show
          Raghu Angadi added a comment - Patch looks good. Handlers now mostly ignore 'running' and might wait forever. May be handlers could use poll() rather than take() if we want to keep the functionality. I am not sure how important it is. > Remember hadoop-2188 removes the client timeout. It does help the application greatly, but not kernel/OS. I guess most TCP stacks are designed to handle this case well.. especially if the connections are not accepted by the application.
          Hide
          Hairong Kuang added a comment -

          This patch implements callQueue as a blocking queue.

          Show
          Hairong Kuang added a comment - This patch implements callQueue as a blocking queue.
          Hide
          Hairong Kuang added a comment -

          Remember hadoop-2188 removes the client timeout.

          On a second thought, I'd like to implement the call queue as a bounded buffer. Adding an entry to a full queue causes a thread to block and removing an entry from an empty queue also blocks the thread. So no more busy waiting and cleaner interface.

          Show
          Hairong Kuang added a comment - Remember hadoop-2188 removes the client timeout. On a second thought, I'd like to implement the call queue as a bounded buffer. Adding an entry to a full queue causes a thread to block and removing an entry from an empty queue also blocks the thread. So no more busy waiting and cleaner interface.
          Hide
          Raghu Angadi added a comment -

          > Ok, to avoid busy waiting, the listener checks if the call queue is full or not before it calls select. It waits if the queue is full. How does this sound?

          +1. This is much simpler than fiddling with the keys.

          One minor improvement we could have now or in future is to make the selector thread wait for a low water mark (say 90% of max call queue), when it decides to wait for call queue to be drained. this would reduce number of selector loops when the queue is hovering at max.

          > I think it is ok to accept new connections when the queue is full.

          Not selecting as you suggested will stop accepting new connections, which I think is good. It will fail clients only in extreme cases (ie queue stays full for many minutes).. in such cases, I guess

          Show
          Raghu Angadi added a comment - > Ok, to avoid busy waiting, the listener checks if the call queue is full or not before it calls select. It waits if the queue is full. How does this sound? +1. This is much simpler than fiddling with the keys. One minor improvement we could have now or in future is to make the selector thread wait for a low water mark (say 90% of max call queue), when it decides to wait for call queue to be drained. this would reduce number of selector loops when the queue is hovering at max. > I think it is ok to accept new connections when the queue is full. Not selecting as you suggested will stop accepting new connections, which I think is good. It will fail clients only in extreme cases (ie queue stays full for many minutes).. in such cases, I guess
          Hide
          Doug Cutting added a comment -

          Waiting while the call queue is full before calling select sounds good. That will block both new connections and new input on existing connections.

          > Our goal is to slow down clients not to fail clients.

          Yes, but at some point we need to backup load onto the clients. We can't queue everything at the server. So if, when the call queue is full we start timing out clients that are connecting, then we can increase the server sockets backlog and/or increase/remove the connect timeout on the client.

          Show
          Doug Cutting added a comment - Waiting while the call queue is full before calling select sounds good. That will block both new connections and new input on existing connections. > Our goal is to slow down clients not to fail clients. Yes, but at some point we need to backup load onto the clients. We can't queue everything at the server. So if, when the call queue is full we start timing out clients that are connecting, then we can increase the server sockets backlog and/or increase/remove the connect timeout on the client.
          Hide
          Hairong Kuang added a comment -

          Ok, to avoid busy waiting, the listener checks if the call queue is full or not before it calls select. It waits if the queue is full. How does this sound?

          I think it is ok to accept new connections when the queue is full. Otherwise it is going to fail the applications. Our goal is to slow down clients not to fail clients.

          Show
          Hairong Kuang added a comment - Ok, to avoid busy waiting, the listener checks if the call queue is full or not before it calls select. It waits if the queue is full. How does this sound? I think it is ok to accept new connections when the queue is full. Otherwise it is going to fail the applications. Our goal is to slow down clients not to fail clients.
          Hide
          Raghu Angadi added a comment -

          > The way we handle this while writing is to set interstedOps to 0 (canceling key is pain).
          Just cancelling is ok in this case I guess. While writing it could be done by multiple threads and that was a problem (HADOOP-2789).

          Show
          Raghu Angadi added a comment - > The way we handle this while writing is to set interstedOps to 0 (canceling key is pain). Just cancelling is ok in this case I guess. While writing it could be done by multiple threads and that was a problem ( HADOOP-2789 ).
          Hide
          Raghu Angadi added a comment -

          It should cause busy wait. The key is removed from selectedKeys, but not canceled. So next select() will return immediately because the socket has read data. The way we handle this while writing is to set interstedOps to 0 (canceling key is pain).

          > Also, shouldn't we avoid accepting new connections when the queue is full too?
          I think so.

          Show
          Raghu Angadi added a comment - It should cause busy wait. The key is removed from selectedKeys , but not canceled. So next select() will return immediately because the socket has read data. The way we handle this while writing is to set interstedOps to 0 (canceling key is pain). > Also, shouldn't we avoid accepting new connections when the queue is full too? I think so.
          Hide
          Doug Cutting added a comment -

          On second thought, it won't cause busy-wait, but rather, since the key is removed from the selector, readAndProcess() won't be called again until more data arrives for that connection. Instead, the connection should be re-registered when the callQueue size is back within range. So perhaps we need a queue of connections that have unread input that we put things into when the call queue is full, then pop them off later when the queue empties. Of course, that would defeat the purpose of this patch...

          Also, shouldn't we avoid accepting new connections when the queue is full too?

          Show
          Doug Cutting added a comment - On second thought, it won't cause busy-wait, but rather, since the key is removed from the selector, readAndProcess() won't be called again until more data arrives for that connection. Instead, the connection should be re-registered when the callQueue size is back within range. So perhaps we need a queue of connections that have unread input that we put things into when the call queue is full, then pop them off later when the queue empties. Of course, that would defeat the purpose of this patch... Also, shouldn't we avoid accepting new connections when the queue is full too?
          Hide
          Doug Cutting added a comment -

          This will cause us to busy-wait, no?

          Show
          Doug Cutting added a comment - This will cause us to busy-wait, no?
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • it is better to use >= instead of = in isCallQueueFull()
          • we should also update the "if (callQueue.size() >= maxQueueSize) {..." in processData()
          Show
          Tsz Wo Nicholas Sze added a comment - it is better to use >= instead of = in isCallQueueFull() we should also update the "if (callQueue.size() >= maxQueueSize) {..." in processData()
          Hide
          Hairong Kuang added a comment -

          This patch does not start to read an IPC request when the call queue is full.

          Show
          Hairong Kuang added a comment - This patch does not start to read an IPC request when the call queue is full.
          Hide
          Doug Cutting added a comment -

          Let's try the easy solution first and see if it suffices, limiting the call queue length and, when it's too long, refusing to accept new calls.

          Show
          Doug Cutting added a comment - Let's try the easy solution first and see if it suffices, limiting the call queue length and, when it's too long, refusing to accept new calls.
          Hide
          Hairong Kuang added a comment -

          > Limit the total number of connections.
          Most operation systems provide some parameters to limit the total number of connections on the machine. It seems that it is unnecessary for the IPC server to concern about it. So I drop this requirement in this jira.

          Regarding to limiting the resource consumed by the incoming requests, we could either
          Solution 1. Limit call queue length. The ipc server stops reading from any connection if the call queue is full.
          Solution 2. Limit the amount of memory used in the call queue. This requires the following changes:

          • Call queue contains unmarshaled RPC requests;
          • Requests get serialized after being taken out of the call queue and before being served;
          • Keep count of the total size of the partial and complete unmarshaled requests;
          • Stop reading from a connection if the incoming request size + total size > max buffer size.

          Solution 1 is simple. Solution 2 provides more accurate memory management and parallel request serialization. Potentially we could do a buffer management to eliminate the frequent buffer allocation & deallocation discussed in HADOOP-2975. But solution 2 is more complicated and not favorable to large messages like block reports.

          Please comment! Thank you.

          Show
          Hairong Kuang added a comment - > Limit the total number of connections. Most operation systems provide some parameters to limit the total number of connections on the machine. It seems that it is unnecessary for the IPC server to concern about it. So I drop this requirement in this jira. Regarding to limiting the resource consumed by the incoming requests, we could either Solution 1. Limit call queue length. The ipc server stops reading from any connection if the call queue is full. Solution 2. Limit the amount of memory used in the call queue. This requires the following changes: Call queue contains unmarshaled RPC requests; Requests get serialized after being taken out of the call queue and before being served; Keep count of the total size of the partial and complete unmarshaled requests; Stop reading from a connection if the incoming request size + total size > max buffer size. Solution 1 is simple. Solution 2 provides more accurate memory management and parallel request serialization. Potentially we could do a buffer management to eliminate the frequent buffer allocation & deallocation discussed in HADOOP-2975 . But solution 2 is more complicated and not favorable to large messages like block reports. Please comment! Thank you.
          Hide
          Hairong Kuang added a comment -

          > Limit call queue length or limit the amount of memory used in the call queue.

          Currently each IPC request has a header indicating the length of the packet. IPC server reads the length, then allocates a buffer with a size of the length, reads the request into the buffer, marshals it, then put the marshaled request into the call queue. If we want to limit the amount of memory used in the call queue, we can keep a counter tracking the total size of the unmarshaled requests. Whenever the server reads a request from the socket into the buffer, increment the counter. When a handler takes the request out of the queue, decrement the counter. It stops reading from the socket if the incoming request will make the counter to exceed the max size limit.

          Show
          Hairong Kuang added a comment - > Limit call queue length or limit the amount of memory used in the call queue. Currently each IPC request has a header indicating the length of the packet. IPC server reads the length, then allocates a buffer with a size of the length, reads the request into the buffer, marshals it, then put the marshaled request into the call queue. If we want to limit the amount of memory used in the call queue, we can keep a counter tracking the total size of the unmarshaled requests. Whenever the server reads a request from the socket into the buffer, increment the counter. When a handler takes the request out of the queue, decrement the counter. It stops reading from the socket if the incoming request will make the counter to exceed the max size limit.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development