Thrift
  1. Thrift
  2. THRIFT-888

async client should also have nonblocking connect

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.4, 0.5
    • Fix Version/s: 0.5
    • Component/s: Java - Library
    • Labels:
      None

      Description

      or its purpose of avoiding thread pooling around blocking I/O is defeated. this was discussed briefly in the comments of https://issues.apache.org/jira/browse/THRIFT-862 and will most likely obviate https://issues.apache.org/jira/browse/THRIFT-843

        Activity

        Hide
        Eric Jensen added a comment -

        Final patch attached for integration. Reviewed this heavily with Ning and he approves. The core change is that it provides lazy non-blocking connect behavior from TNonblockinSocket when it interacts with TAsyncClientManager by adding a new state to the method call. Also included:

        1. bumped the SELECT_TIME down so we can have finer grained timeouts

        2. catch runtime exceptions within the entire manager select thread so that ones generated from a callback don't kill it

        3. refactored the async test a bit: Initial attempts at this had transient failures in the JankyRunnable portion of TestTAsyncClientManager where the caller would either never be called back for the connect or we would get a SocketException: Connection reset by peer on finishConnect. When I profiled the test I found that the single TNonblockingServer$SelectThread was busy doing reads and writes all the way up until the timeout. Even though we make these calls in parallel, we dispatch them in serial in the server and that appeared to be the bottleneck, so it was possible that we started a whole bunch of them but didn't finish any of them until roughly the same time. if each took 0.5ms in the worst case though, that could be 500*100*0.5ms = 2.5s. so basically the one second wait was just too agressive for running 500 threads with 100 calls per thread. I increased that and reduced the numThreads to 50 (to fix the connection reset problems that i can't explain) and now no longer see any failures after running the test 100 times on snow leopard or linux.

        This test is really dirty, mostly because of copied and pasted code but also because it asserts within callbacks, which throws assertion failures that the selector thread subsequently catches. We should store off the throwables and assert outside of the callbacks. Also it appears that part of why the test is so slow and I had to mess so much with num threads and timeouts is because we do synchronization within the callbacks to wake up the calling thread. This means the selector itself is constantly having to notify a bunch of threads, and I caught that as something taking time in the stack traces (since it blocks all other selector progress). A cleaner, faster test would probably be to have a truly async client as the test driver, enqueing the next iteration into a ConcurrentLinkedQueue on the callback rather than doing something synchronized.

        That said, I think it is sufficient for now and definitely covers the new non-blocking connect functionality. I fixed a bunch of stuff about how it logged errors and switched to using latches instead of objects since there was a race condition where you could notify before the caller waits.

        Show
        Eric Jensen added a comment - Final patch attached for integration. Reviewed this heavily with Ning and he approves. The core change is that it provides lazy non-blocking connect behavior from TNonblockinSocket when it interacts with TAsyncClientManager by adding a new state to the method call. Also included: 1. bumped the SELECT_TIME down so we can have finer grained timeouts 2. catch runtime exceptions within the entire manager select thread so that ones generated from a callback don't kill it 3. refactored the async test a bit: Initial attempts at this had transient failures in the JankyRunnable portion of TestTAsyncClientManager where the caller would either never be called back for the connect or we would get a SocketException: Connection reset by peer on finishConnect. When I profiled the test I found that the single TNonblockingServer$SelectThread was busy doing reads and writes all the way up until the timeout. Even though we make these calls in parallel, we dispatch them in serial in the server and that appeared to be the bottleneck, so it was possible that we started a whole bunch of them but didn't finish any of them until roughly the same time. if each took 0.5ms in the worst case though, that could be 500*100*0.5ms = 2.5s. so basically the one second wait was just too agressive for running 500 threads with 100 calls per thread. I increased that and reduced the numThreads to 50 (to fix the connection reset problems that i can't explain) and now no longer see any failures after running the test 100 times on snow leopard or linux. This test is really dirty, mostly because of copied and pasted code but also because it asserts within callbacks, which throws assertion failures that the selector thread subsequently catches. We should store off the throwables and assert outside of the callbacks. Also it appears that part of why the test is so slow and I had to mess so much with num threads and timeouts is because we do synchronization within the callbacks to wake up the calling thread. This means the selector itself is constantly having to notify a bunch of threads, and I caught that as something taking time in the stack traces (since it blocks all other selector progress). A cleaner, faster test would probably be to have a truly async client as the test driver, enqueing the next iteration into a ConcurrentLinkedQueue on the callback rather than doing something synchronized. That said, I think it is sufficient for now and definitely covers the new non-blocking connect functionality. I fixed a bunch of stuff about how it logged errors and switched to using latches instead of objects since there was a race condition where you could notify before the caller waits.
        Hide
        Bryan Duxbury added a comment -

        I think this generally looks pretty good. A few comments:

        The indentation in TAsyncClientManager$SelectThread.run is a little funky.

        If we're going to put a finally block in TNonblockingServer's run method, we should probably at least log whatever gets thrown - I hate silent catches.

        If you have concerns about the code quality in the test, I think you'd better just go ahead fix it. Realistically, if you leave it the way it is, it's not going to get changed by anyone else in the foreseeable future.

        Show
        Bryan Duxbury added a comment - I think this generally looks pretty good. A few comments: The indentation in TAsyncClientManager$SelectThread.run is a little funky. If we're going to put a finally block in TNonblockingServer's run method, we should probably at least log whatever gets thrown - I hate silent catches. If you have concerns about the code quality in the test, I think you'd better just go ahead fix it. Realistically, if you leave it the way it is, it's not going to get changed by anyone else in the foreseeable future.
        Hide
        Eric Jensen added a comment -

        updated to address bryan's comments

        Show
        Eric Jensen added a comment - updated to address bryan's comments
        Hide
        Bryan Duxbury added a comment -

        I just committed this. Thanks for the patch!

        Show
        Bryan Duxbury added a comment - I just committed this. Thanks for the patch!
        Hide
        Eric Jensen added a comment -

        I fixed the indentation and logging in the attached patch. Ning is going to rewrite the test in a separate patch.

        Show
        Eric Jensen added a comment - I fixed the indentation and logging in the attached patch. Ning is going to rewrite the test in a separate patch.

          People

          • Assignee:
            Eric Jensen
            Reporter:
            Eric Jensen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development