Uploaded image for project: 'Thrift'
  1. Thrift
  2. THRIFT-862

Async client issues / improvements

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.4
    • 0.6
    • Java - Library
    • None

    Description

      From Thrift-845:

      Hey Eric,

      You bring up some good points. We have indirect communication between the client manager and client going through the method call object, which we can refactor into direct communication. I'd like to keep the state machine separate from the async client manager, though - the client manager only coordinates transitioning the method call on ready. Responses inline for the other comments:

      1. Timeouts per-transition aren't really what I want as a consumer of this api. It looks like there are about five states, which as a consumer I need to know since the effective timeout on the whole call is then 5x the one I set. I'd rather be guaranteed I will be called back one way or another within the timeout i set, counting time i spend in the selector's queue, etc. Seems like that would be easy to implement.

      • The naming is misleading - we can have idle socket timeout or an overall method call timeout (more like a deadline). I coded for the former case, so we can modify to the latter or rename. Modifying is trivial - we just need to track the method call start time, as opposed to the last action time. The effective deadline is arbitrarily long, since we're updating the last action timestamp on each socket read/write, which doesn't necessarily correspond to a transition between method states.

      2. You should not be allowed to set a timeout lower than TAsyncClientManager.SELECT_TIME, and the value there should be much lower, maybe 5ms? Because its per-transition as above, the true minimum timeout for an entire call is 1s if each of its states took 200ms. We've actually run clients with lower than 200ms timeouts.

      • Agreed, though we won't be able to get to 5ms resolution. The reason is because the JVM makes no guarantees on actual sleep time. Maybe a better solution would be to use Selector.selectNow(), which returns immediately with all ready keys, making our timeout granularity as small as possible.

      3. TAsyncMethodCall.onError does not do anything to remove itself from TAsyncClientManager.timeoutWatchSet, so unless it shows up in selectedKeys() it will stay there until its timeout. Not a big deal, but could make that set prohibitively large if there are many errors within a timeout interval.

      • TAsyncClientManager removes the method after a transition if the method's client.hasError(). Again, somewhat confusing since it's relying on the method calling the client's onError.

      4. We call TAsyncMethodCall.transition before adding it to timeoutWatchSet, so lastTransitionTime is initialized before it is inspected. But, this is a somewhat dangerous pattern to allow an uninitialized primitive to ever be accessible. You should probably initialize it to the current time or a sentinel on instantiation.

      • Current time is good - if we end up using the deadline definition of timeout, we don't even have to update the value.

      5. it might be safer for lastTransitionTime = System.currentTimeMillis() to be at the top of transition(SelectionKey key) so you can guarantee the timestamp is updated even for errant transitions...don't think it matter's functionally in the current code, but it could in refactorings. you don't want to inadvertently obscure an error with a timeout.

      • Good point. I wanted the timestamp to reflect the last socket activity - putting at the beginning of transition could make it inaccurate on long socket reads/writes. Maybe we can use the finally block?

      6. TNonblockingServerSocket is the only non-test, non-tutorial thrift source file that uses System.err.println instead of LOGGER.warn

      • We can easily fix.

      7. You might want to catch Exception instead of Throwable? I don't think you intend or want to catch things like OutOfMemoryError, etc. The docs say "An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch"

      • Also easily fixed, though I want to make sure we didn't do this for a reason earlier. Bryan, why Throwable vs Exception?

      8. TAsyncClientManager.call should really throw an exception if its SelectThread is not running (because its run method did not catch an unexpected Throwable) instead of inserting it into a queue that will grow unbounded and cause an OOM. You probably want a catch and throw block in run that cleans up by calling onError on all pending calls and sets its done flag. You probably also want to do this on ClosedSelectorException instead of just catching and logging it in an infinite loop. It's odd that you catch it on selectedKeys() but not select().

      • That catch got folded into the catch(Throwable)

      9. timeoutIdleMethods would be a bit faster if it only called currentTimeMillis() once instead of per-check

      • Good point. I was going for accuracy but even for large timeout watch sets the iteration time should be negligible.

      10. Probably an unnecessary optimization, but instead of relying on Object.hashCode for removing timed out methods, you could identify them by their start timestamp from currentTimeMillis() + a few bits of a static atomic sequence counter, put them in a TreeSet, and shortcut the iteration for timed out methods when you hit the first one that's not too old.

      • Definitely, this was the original plan. We'll get it in on next iteration.

      Attachments

        1. thrift-862-v3.diff
          28 kB
          Ning Liang
        2. thrift-862-v2.diff
          28 kB
          Ning Liang
        3. thrift-862.diff
          27 kB
          Ning Liang

        Activity

          People

            ning Ning Liang
            ning Ning Liang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: