Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
jcs-1.3, jcs-2.0-beta-1
-
None
-
All
Description
Classes...
org.apache.jcs.auxiliary.remote.RemoteCache
and
org.apache.jcs.auxiliary.remote.server.RemoteCacheServerFactory
.. both try to set a timeout on RMI connections between the remote cache server and client machines using the following code to install a timeout-enabled socket factory which the RMI subsystem subsequently uses...
RMISocketFactory.setSocketFactory( new RMISocketFactory() {
public Socket createSocket( String host, int port ) throws IOException
public ServerSocket createServerSocket( int port ) throws IOException
{ return new ServerSocket( port ); }});
The socket factory code above applies a "read timeout" to RMI sockets such that if a connection is already established and subsequently stalls or a machine goes offline, the timeout will break the connection as intended. The code does not apply a "connect timeout" however, which means that if an attempt is made to establish a new connection to a machine which is offline, the socket connection attempt will stall for an infinite amount of time (such is the default connect timeout), and therefore the thread opening the connection will stall permanently in the JVM.
This is not a bug in JCS code, it was a limitation in JDK3 in that you could not AFAIK set a connection timeout on a socket.
As of JDK4, there is a socket.connect(address, timeout) method, so this issue can be fixed.
Here's the required replacement code:
RMISocketFactory.setSocketFactory( new RMISocketFactory() {
public Socket createSocket( String host, int port ) throws IOException { Socket socket = new Socket(); socket.setSoTimeout(timeoutMillis); socket.setSoLinger( false, 0 ); socket.connect(new InetSocketAddress(host, port), timeoutMillis); return socket; }
public ServerSocket createServerSocket( int port ) throws IOException { return new ServerSocket( port ); }
});
This was an issue for us recently. We fixed it by installing the RMISocketFactory above in the JVM before initializing JCS. We have tested and confirmed that this code works well with JCS, it times out reads same as before and it now times out new connection attempts too.
How about including this in the next version of JCS?
By the way I read some JCS mailing list archives from last time socket timeouts were discussed. Not sure if this will be helpful to anyone... but we found that if an RMI client and server were running on the same subnet, timeouts were not required and each machine detected that the other was offline immediately. Cross-subnet through our router however, timeouts became important as attempts to connect to an offline machine resulted in JCS threads hanging whilst trying to connect.
We are not sure, but we suspect that this is related to our firewall blocking required ICMP "host not reachable" packets between subnets, causing different behaviour depending on the network setup. The replacement code above allows our machines in both subnets to recover when a machine is offline, previously they just stalled.