I am running mod_jk/1.2.26 on a front end server talking to a group of tomcats running on solaris boxes at the back end. The other day one of the Solaris boxes froze, leaving the network connection to the switch up. mod_jk failed to notice that the tomcats on this box were down and kept sending requests. We can reproduce this consistently: if we halt the box, mod_jk does not notice that the tomcat is down. On one occasion we waited more than one hour, and still requests were being sent to the dead tomcat. mod_jk has lots of options for setting timeouts, but none of them seem to deal with this use case. On Linux, we can workaround this by setting the mod_jk socket_timeout however this setting is not supported on solaris. It would be nice to offer a workaround on this platform to make failure monitoring useful.
Please provide your configuration.
workers.properties: # default worker list worker.list=word,jkstatus # worker template worker.template.port=9009 worker.template.type=ajp13 worker.template.lbfactor=1 worker.template.socket_keepalive=0 worker.template.connect_timeout=5000 worker.template.prepost_timeout=2000 worker.template.reply_timeout=40000 worker.template.connection_pool_size=1 worker.template.connection_pool_timeout=60 # workers definition worker.tomcat1.reference=worker.template worker.tomcat1.host=tomtom1.online.local # load balancer definition worker.word.type=lb worker.word.max_reply_timeouts=3 worker.word.balance_workers=escappdev1 # status definition worker.jkstatus.type=status
Sorry for the long silence. If you are still observing this, could you please update to 1.2.32? There have been lots of improvement. If the problem persists, please provide a JK log file. If you can reproduce the problem on a test system, a debug log file would be nice, otherwise the info log level should still be helpful. In addition: please issue "netstat -an" on the Apache and Tomcat servers once the problem happens and provide the output. Thanks!
Closing this as WORKSFORME since I don't expect this issue to be possible with the releases from the last few years. Properly using cping/cpong and connect_timeout should prevent the behavior described in the original post.