While debugging the bug in the first implementation of
THRIFT-5214, I started to look into the rabbit hole of I/O timeouts in the go library (mainly the socket timeout on TSocket), and start to realize that the current way we handle it is not great.
From client's perspective, how a request goes is:
client prepare the request TStruct -> send the request over the wire -> start reading the response over the wire -> handle the response or I/O error
The problem here, is that server will only send the first byte of response out after it finished processing the request. So if the client incorrectly configured a socket timeout on the TSocket used by this request to a value that's too low, the reading will just report i/o timeout when the deadline reaches, and the client will just fail the whole request.
This will cause problems:
- Client can't set socket timeout to something too low. In fact, if they use a client pool (so for most requests there's no overhead on connecting), they need the socket timeout to be at least the latency SLA of the server they talk to, otherwise there's a serious risk that the client will fail a lot of requests prematurely.
- On the other hand, setting the socket timeout to something too high is also a bad practice. In case that server is overloaded and cannot handle requests in a timely fashion, client should have a way to fail faster, instead of waiting for a very long timeout to expire.
If the client set a deadline on the context object with the call, their expectation usually would be that the request will fail after the deadline passed (a small overhead is usually acceptable). But the socket timeout is usually some fixed value configured to the program, while the actual deadline on the context is more variable (e.g. this could be a server talking to upstream server, it has a fixed totally deadline for the whole endpoint, but the steps before that might take variable time so the deadline left here can vary a lot).
This leads to the point that we would need a way to keep socket timeout and the deadline set in the context object in-sync.
There are a few different options. The first obvious one, is to pass the context object all the way to TSocket.Read, so that function can extract the deadline out of the context object (if any), and set that as the socket timeout instead of the pre-configured, fixed one.
But that is also easy to be ruled out, because that would change the function signature of TSocket.Read, making TSocket no longer implement io.Reader interface.
The next option, I think, is to handle it on TProtocol level. The idea is that we pass the context object all the way into TProtocol.ReadMessageBegin (and maybe other read functions of TProcotol?). This way, it can handle the read errors, check if it's a timeout error, and if it is and the context deadline haven't passed, it can just keep retrying again. This way, we can set the fixed socket timeout to a low number when we know we will always have a real deadline in the context attached, and just let TProtocol implementation to keep retrying instead. If the user don't attach a deadline to the context object, then they still need to set a large number on the socket timeout.
A slightly different option, is to add SetReadTimeout function to TTransport interface. So all the TTransport implementations wrapping TSocket should just pass that along. If the terminal one is actually not a TSocket (for example, it's TMemoryBuffer), then this function just does nothing. This way, the TProtocol.Read* functions can just extract deadline out of the context object, and call their TTransport.SetReadTimeout with that deadline.
Either way, this is going to be a breaking change that we need to add context object to TProtocol.Read* function signatures. As a result, I believe some compiler change might also be required to pass it all the way to TProtocol calls. But even if compiler changes are needed, that part should be minimal (just make sure we are passing context object correctly), and most of the changes would be on TProtocol implementations.