Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.16.0
    • Fix Version/s: 0.18.0
    • Component/s: ipc
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Removed property ipc.client.maxidletime from the default configuration. The allowed idle time is twice ipc.client.connection.maxidletime.
      Show
      Removed property ipc.client.maxidletime from the default configuration. The allowed idle time is twice ipc.client.connection.maxidletime.

      Description

      IPC server determines if a connection is idle or not by checking if the connection does not have any IO activity for a predefined max idle time. An idle connection will be closed even if the connection still has outstanding requests or replies. This causes RPC failures when a server becomes slow or if a request takes a long time to be served. In jira, I'd like to propose the following changes to IPC idle management:
      1. Add data structures to the IPC server that keep track of outstanding requests.
      2. IPC server does not close a connection that has outstanding requests/replies even when it has no IO activities for a while.
      3. The default client-side max idle time should be in several minutes not 1 second.
      4. The server-side max idle time should be greater than the client-side max idle time, for example, twice of the client-side max idle time. So server mainly deals with clients that are crashed without closing
      its connections.

      1. idleConn.patch
        3 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Ankur added a comment -

          > 1. Add data structures to the IPC server ...
          I think one very simple way of doing this would be to add a 'pendingRequests' field to the Connection class that keeps the count of pending requests. However if one needs to see the details of pending requests for a connection then making 'pendingRequests' as a list that contains all the unprocessed calls might be more suitable but slightly redundant as the unprocessed calls are already queued in the server 'callQueue'.

          > 2. IPC server does not close a connection ...
          > 3. The default client-side ...
          > 4. The server-side max ...
          This can be taken care of in Connection.timedOut() by checking the connection's responseQueue/pendingRequests count. However we need to be able to differentiate between server slowdown, client slowdown and client crashes. Setting a higher idle time for the client and even higher idle time for the server seems to be the right choice.

          Show
          Ankur added a comment - > 1. Add data structures to the IPC server ... I think one very simple way of doing this would be to add a 'pendingRequests' field to the Connection class that keeps the count of pending requests. However if one needs to see the details of pending requests for a connection then making 'pendingRequests' as a list that contains all the unprocessed calls might be more suitable but slightly redundant as the unprocessed calls are already queued in the server 'callQueue'. > 2. IPC server does not close a connection ... > 3. The default client-side ... > 4. The server-side max ... This can be taken care of in Connection.timedOut() by checking the connection's responseQueue/pendingRequests count. However we need to be able to differentiate between server slowdown, client slowdown and client crashes. Setting a higher idle time for the client and even higher idle time for the server seems to be the right choice.
          Hide
          Hairong Kuang added a comment -

          Ankur, you read my mind. Here is a patch that implements the idea.

          Show
          Hairong Kuang added a comment - Ankur, you read my mind. Here is a patch that implements the idea.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383264/idleConn.patch
          against trunk revision 662569.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2540/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2540/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2540/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2540/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383264/idleConn.patch against trunk revision 662569. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2540/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2540/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2540/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2540/console This message is automatically generated.
          Hide
          Ankur added a comment -

          Looks good, just one point

          • Connection.timedOut() will not be able to differentiate between client crash and client slowdown in case of pending RPC requests. Should we just wait for max idle time and time out the connection if there has been no activity instead of checking pending RPC request ?
          Show
          Ankur added a comment - Looks good, just one point Connection.timedOut() will not be able to differentiate between client crash and client slowdown in case of pending RPC requests. Should we just wait for max idle time and time out the connection if there has been no activity instead of checking pending RPC request ?
          Hide
          Ankur added a comment -

          On second thoughts a crashed client would cause an IOException in Responder thread on a response write attempt. This should happen in ResponderThread.processResponse() at channel.write(call.response) (line 587) if the client has crashed, right ? In this case the connection is closed properly in the finally block.

          So I don't see any issue with this patch

          Show
          Ankur added a comment - On second thoughts a crashed client would cause an IOException in Responder thread on a response write attempt. This should happen in ResponderThread.processResponse() at channel.write(call.response) (line 587) if the client has crashed, right ? In this case the connection is closed properly in the finally block. So I don't see any issue with this patch
          Hide
          Hairong Kuang added a comment -

          Ankur, thanks for reviewing this patch. Yes, this patch mainly handles those crashed clients that have no outstanding requests. Crashed clients with outstanding requests are detected by the responder when writing replies. Slow clients with outstanding requests are handled by Server.Responder.doPurge.

          Show
          Hairong Kuang added a comment - Ankur, thanks for reviewing this patch. Yes, this patch mainly handles those crashed clients that have no outstanding requests. Crashed clients with outstanding requests are detected by the responder when writing replies. Slow clients with outstanding requests are handled by Server.Responder.doPurge.
          Hide
          Hairong Kuang added a comment -

          I just committed this.

          Show
          Hairong Kuang added a comment - I just committed this.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development