44290 – mod_jk/1.2.26: retry is not useful for an important use case

Bug 44290 - mod_jk/1.2.26: retry is not useful for an important use case

Summary: mod_jk/1.2.26: retry is not useful for an important use case

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Tomcat Connectors
Classification:	Unclassified
Component:	Common (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P2 normal (vote)
Target Milestone:	---
Assignee:	Tomcat Developers Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-01-24 05:49 UTC by Jergen Dutch
Modified:	2014-12-22 18:20 UTC (History)
CC List:	0 users

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jergen Dutch 2008-01-24 05:49:52 UTC

I am running mod_jk/1.2.26 on a front end server talking to a group of tomcats
running on solaris boxes at the back end.

The other day one of the Solaris boxes froze, leaving the network connection to
the switch up. mod_jk failed to notice that the tomcats on this box were down
and kept sending requests.

We can reproduce this consistently: if we halt the box, mod_jk does not notice
that the tomcat is down. On one occasion we waited more than one hour, and still
requests were being sent to the dead tomcat.

mod_jk has lots of options for setting timeouts, but none of them seem to deal
with this use case.

On Linux, we can workaround this by setting the mod_jk socket_timeout however
this setting is not supported on solaris. It would be nice to offer a workaround
on this platform to make failure monitoring useful.

Comment 1 Rainer Jung 2008-01-24 06:05:23 UTC

Please provide your configuration.

Comment 2 Jergen Dutch 2008-01-24 06:27:48 UTC

workers.properties:

# default worker list
worker.list=word,jkstatus

# worker template
worker.template.port=9009
worker.template.type=ajp13
worker.template.lbfactor=1
worker.template.socket_keepalive=0
worker.template.connect_timeout=5000
worker.template.prepost_timeout=2000
worker.template.reply_timeout=40000
worker.template.connection_pool_size=1
worker.template.connection_pool_timeout=60

# workers definition
worker.tomcat1.reference=worker.template
worker.tomcat1.host=tomtom1.online.local

# load balancer definition
worker.word.type=lb
worker.word.max_reply_timeouts=3
worker.word.balance_workers=escappdev1

# status definition
worker.jkstatus.type=status

Comment 3 Rainer Jung 2011-10-25 18:26:00 UTC

Sorry for the long silence.

If you are still observing this, could you please update to 1.2.32? There have been lots of improvement. If the problem persists, please provide a JK log file. If you can reproduce the problem on a test system, a debug log file would be nice, otherwise the info log level should still be helpful.

In addition: please issue "netstat -an" on the Apache and Tomcat servers once the problem happens and provide the output.

Thanks!

Comment 4 Rainer Jung 2014-12-22 18:20:27 UTC

Closing this as WORKSFORME since I don't expect this issue to be possible with the releases from the last few years. Properly using cping/cpong and connect_timeout should prevent the behavior described in the original post.