Bug 44290 - mod_jk/1.2.26: retry is not useful for an important use case
Summary: mod_jk/1.2.26: retry is not useful for an important use case
Status: RESOLVED WORKSFORME
Alias: None
Product: Tomcat Connectors
Classification: Unclassified
Component: Common (show other bugs)
Version: unspecified
Hardware: Other other
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-24 05:49 UTC by Jergen Dutch
Modified: 2014-12-22 18:20 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jergen Dutch 2008-01-24 05:49:52 UTC
I am running mod_jk/1.2.26 on a front end server talking to a group of tomcats
running on solaris boxes at the back end.

The other day one of the Solaris boxes froze, leaving the network connection to
the switch up. mod_jk failed to notice that the tomcats on this box were down
and kept sending requests.

We can reproduce this consistently: if we halt the box, mod_jk does not notice
that the tomcat is down. On one occasion we waited more than one hour, and still
requests were being sent to the dead tomcat.

mod_jk has lots of options for setting timeouts, but none of them seem to deal
with this use case.

On Linux, we can workaround this by setting the mod_jk socket_timeout however
this setting is not supported on solaris. It would be nice to offer a workaround
on this platform to make failure monitoring useful.
Comment 1 Rainer Jung 2008-01-24 06:05:23 UTC
Please provide your configuration.
Comment 2 Jergen Dutch 2008-01-24 06:27:48 UTC
workers.properties:

# default worker list
worker.list=word,jkstatus

# worker template
worker.template.port=9009
worker.template.type=ajp13
worker.template.lbfactor=1
worker.template.socket_keepalive=0
worker.template.connect_timeout=5000
worker.template.prepost_timeout=2000
worker.template.reply_timeout=40000
worker.template.connection_pool_size=1
worker.template.connection_pool_timeout=60

# workers definition
worker.tomcat1.reference=worker.template
worker.tomcat1.host=tomtom1.online.local

# load balancer definition
worker.word.type=lb
worker.word.max_reply_timeouts=3
worker.word.balance_workers=escappdev1

# status definition
worker.jkstatus.type=status
Comment 3 Rainer Jung 2011-10-25 18:26:00 UTC
Sorry for the long silence.

If you are still observing this, could you please update to 1.2.32? There have been lots of improvement. If the problem persists, please provide a JK log file. If you can reproduce the problem on a test system, a debug log file would be nice, otherwise the info log level should still be helpful.

In addition: please issue "netstat -an" on the Apache and Tomcat servers once the problem happens and provide the output.

Thanks!
Comment 4 Rainer Jung 2014-12-22 18:20:27 UTC
Closing this as WORKSFORME since I don't expect this issue to be possible with the releases from the last few years. Properly using cping/cpong and connect_timeout should prevent the behavior described in the original post.