I've recently found a couple of different failure scenarios with NettyTransceiver that result in:
- Client threads blocking for long periods of time (uninterruptibly at that) while holding the stateLock write lock
- RPCs (either sync or async) never returning because a failure in sending the RPC was not propagated back up to the caller
The patch I'm going to submit will probably be a lot easier to understand, but I'll try to explain the main problems I found. There is a single type of underlying connectivity issue that seems to trigger both of these problems in NettyTransceiver: a failure at the network layer causes all packets to be dropped somewhere between the RPC client and server. You might think this would be a rare scenario, but it has happened several times in our production environment and usually occurs after the RPC server machine becomes unresponsive and needs to be physically rebooted. The only way I've been able to reproduce this scenario for testing purposes has been to set up an iptables rule on the RPC server that simply drops all incoming packets from the client. For example, if the client's IP is 10.0.0.1 I would use the following iptables rule on the server to reproduce the failure:
After looking through a lot of stack traces I think I've identified 2 main problems:
Problem 1: NettyTransceiver calls ChannelFuture#awaitUninterruptibly(long) in a couple places, getChannel() and disconnect(boolean,boolean,Throwable). Under the dropped packet scenario I outlined above, the client thread ends up blocking uninterruptibly for the entire connection timeout duration while holding the stateLock write lock. The stack trace for this situation looks like this:
At a minimum it should be possible to interrupt these connection attempts.
Problem 2: When an error occurs writing to the Netty channel the error is not passed back up the stack or callback chain (whether it's a sync or async RPC), so the client can end up waiting indefinitely for an RPC that will never return because an error occurred sending the Netty packet (i.e. it was never sent to the server in the first place). This scenario might yield a stack trace like the following:
It's difficult to provide a unit test for these issues because a connection refused error alone will not trigger it. The only way I've been able to reliably reproduce it is by setting the iptables rule I mentioned above. Hopefully a code review will be sufficient, but if necessary I can try to find a way to create a unit test.