Bug 44290

Summary: mod_jk/1.2.26: retry is not useful for an important use case
Product: Tomcat Connectors Reporter: Jergen Dutch <jergendutch>
Component: CommonAssignee: Tomcat Developers Mailing List <dev>
Status: NEEDINFO ---    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: Other   
OS: other   

Description Jergen Dutch 2008-01-24 05:49:52 UTC
I am running mod_jk/1.2.26 on a front end server talking to a group of tomcats
running on solaris boxes at the back end.

The other day one of the Solaris boxes froze, leaving the network connection to
the switch up. mod_jk failed to notice that the tomcats on this box were down
and kept sending requests.

We can reproduce this consistently: if we halt the box, mod_jk does not notice
that the tomcat is down. On one occasion we waited more than one hour, and still
requests were being sent to the dead tomcat.

mod_jk has lots of options for setting timeouts, but none of them seem to deal
with this use case.

On Linux, we can workaround this by setting the mod_jk socket_timeout however
this setting is not supported on solaris. It would be nice to offer a workaround
on this platform to make failure monitoring useful.
Comment 1 Rainer Jung 2008-01-24 06:05:23 UTC
Please provide your configuration.
Comment 2 Jergen Dutch 2008-01-24 06:27:48 UTC
workers.properties:

# default worker list
worker.list=word,jkstatus

# worker template
worker.template.port=9009
worker.template.type=ajp13
worker.template.lbfactor=1
worker.template.socket_keepalive=0
worker.template.connect_timeout=5000
worker.template.prepost_timeout=2000
worker.template.reply_timeout=40000
worker.template.connection_pool_size=1
worker.template.connection_pool_timeout=60

# workers definition
worker.tomcat1.reference=worker.template
worker.tomcat1.host=tomtom1.online.local

# load balancer definition
worker.word.type=lb
worker.word.max_reply_timeouts=3
worker.word.balance_workers=escappdev1

# status definition
worker.jkstatus.type=status
Comment 3 Rainer Jung 2011-10-25 18:26:00 UTC
Sorry for the long silence.

If you are still observing this, could you please update to 1.2.32? There have been lots of improvement. If the problem persists, please provide a JK log file. If you can reproduce the problem on a test system, a debug log file would be nice, otherwise the info log level should still be helpful.

In addition: please issue "netstat -an" on the Apache and Tomcat servers once the problem happens and provide the output.

Thanks!