Bug 37896 - FastAsyncSocketSender blocks all threads on socket error
Summary: FastAsyncSocketSender blocks all threads on socket error
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 5
Classification: Unclassified
Component: Catalina:Cluster (show other bugs)
Version: 5.5.12
Hardware: Other other
: P2 normal (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-12-14 03:54 UTC by Ted Leung
Modified: 2005-12-14 23:57 UTC (History)
0 users



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ted Leung 2005-12-14 03:54:44 UTC
If one server fails "badly" (I believe resulting in a socket time out error) the
FastAsyncSocketSender is locked by a thread and causes a backlog on all
subsequent http threads causing the entire machine to run out of sockets.

Details below :

Default cluster settings : 
<Cluster className="org.apache.catalina.cluster.tcp.SimpleTcpCluster" />

We have mutlipele web machines (6 of them). Something really bad happened at our
data center (not sure what, cable fault, some dweeb tripped on our ethernet,
don't quite know yet) causing one of our web servers to die.

The rest of the machines then back logged trying to replicate to the dead
machine, which caused all the web servers to fill up the max threads causing a
site outtage.

We took stack traces at the point in time where we had to restart the tomcat
process, what I believe to be the relavent stack traces are included below. 

You can see one of the http threads (143) is trying to replicate synchronously
(which I found odd using fastasynch but okay) I believe this thread is stuck on
a 2 minute socket time out and currently holds a lock on FastAsych.

Notice the Cluster-MembershipReceiver thread is waiting for the fastAsynch
object and currently holds a lock on ReplicationTransmitter.

Notice Http thread (147) is waiting on ReplicationTransmitter. As a result I
have about 298 other Http threads all waiting on ReplicationTransmitter. I had
300 threads configured.

Now I realised after a "while" the socket will time out and it'll all work
itself out but our site was stuck in this mode for over 10 minutes so I think
this is kind of a bug on the basis that 1 machine dying (albiet badly) shouldn't
cause all other machines to backlog at all.

------------

"http-80-Processor143" daemon prio=1 tid=0x084ad748 nid=0x6953 runnable
[0x7e7bf000..0x7e7bf63c]
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:124)
	at org.apache.catalina.cluster.tcp.DataSender.writeData(DataSender.java:830)
	at org.apache.catalina.cluster.tcp.DataSender.pushMessage(DataSender.java:772)
	at org.apache.catalina.cluster.tcp.DataSender.sendMessage(DataSender.java:598)
	- locked <0x4e7864f8> (a org.apache.catalina.cluster.tcp.FastAsyncSocketSender)
	at
org.apache.catalina.cluster.tcp.ReplicationTransmitter.sendMessageData(ReplicationTransmitter.java:868)
	at
org.apache.catalina.cluster.tcp.ReplicationTransmitter.sendMessageClusterDomain(ReplicationTransmitter.java:460)
	at
org.apache.catalina.cluster.tcp.SimpleTcpCluster.sendClusterDomain(SimpleTcpCluster.java:1017)
	at
org.apache.catalina.cluster.tcp.ReplicationValve.sendSessionReplicationMessage(ReplicationValve.java:333)
	at
org.apache.catalina.cluster.tcp.ReplicationValve.invoke(ReplicationValve.java:271)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
	at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
	at
org.apache.catalina.valves.FastCommonAccessLogValve.invoke(FastCommonAccessLogValve.java:495)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:868)
	at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:663)
	at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
	at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
	at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
	at java.lang.Thread.run(Thread.java:595)



"Cluster-MembershipReceiver" daemon prio=1 tid=0x78804ad8 nid=0x661c waiting for
monitor entry [0x786ff000..0x786ff73c]
	at org.apache.catalina.cluster.tcp.DataSender.disconnect(DataSender.java:560)
	- waiting to lock <0x4e7864f8> (a
org.apache.catalina.cluster.tcp.FastAsyncSocketSender)
	at
org.apache.catalina.cluster.tcp.FastAsyncSocketSender.disconnect(FastAsyncSocketSender.java:295)
	at
org.apache.catalina.cluster.tcp.ReplicationTransmitter.remove(ReplicationTransmitter.java:689)
	- locked <0x4e7a4e68> (a org.apache.catalina.cluster.tcp.ReplicationTransmitter)
	at
org.apache.catalina.cluster.tcp.SimpleTcpCluster.memberDisappeared(SimpleTcpCluster.java:1124)
	at
org.apache.catalina.cluster.mcast.McastService.memberDisappeared(McastService.java:455)
	at
org.apache.catalina.cluster.mcast.McastServiceImpl.receive(McastServiceImpl.java:221)
	at
org.apache.catalina.cluster.mcast.McastServiceImpl$ReceiverThread.run(McastServiceImpl.java:253)



"http-80-Processor147" daemon prio=1 tid=0x084b1208 nid=0x6957 waiting for
monitor entry [0x7e8bf000..0x7e8bf83c]
	at
org.apache.catalina.cluster.tcp.ReplicationTransmitter.addStats(ReplicationTransmitter.java:702)
	- waiting to lock <0x4e7a4e68> (a
org.apache.catalina.cluster.tcp.ReplicationTransmitter)
	at
org.apache.catalina.cluster.tcp.ReplicationTransmitter.sendMessageData(ReplicationTransmitter.java:870)
	at
org.apache.catalina.cluster.tcp.ReplicationTransmitter.sendMessageClusterDomain(ReplicationTransmitter.java:460)
	at
org.apache.catalina.cluster.tcp.SimpleTcpCluster.sendClusterDomain(SimpleTcpCluster.java:1017)
	at
org.apache.catalina.cluster.tcp.ReplicationValve.sendSessionReplicationMessage(ReplicationValve.java:333)
	at
org.apache.catalina.cluster.tcp.ReplicationValve.invoke(ReplicationValve.java:271)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
	at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
	at
org.apache.catalina.valves.FastCommonAccessLogValve.invoke(FastCommonAccessLogValve.java:495)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:868)
	at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:663)
	at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
	at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
	at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
	at java.lang.Thread.run(Thread.java:595)
Comment 1 Peter Rossbach 2005-12-14 08:32:12 UTC
Update to the current svn head.
I have fixed a DataSender refactoring bug. This means that tomcat 5.5.12 cluster
send all messages thread synchronized with only one sender socket. 
Other bug is that the membership message format has changed (s. Bug 37808)

Can you please test with the newest cluster code?

I reference your report now at changelog.
I am not sure that your szenarion not show another problem!

Is it true, that all your nodes are inside the same cluster replication domain?

Sorry for the trouble and thanks for analyse and reporting the bug
Peter
Comment 2 Ted Leung 2005-12-14 23:32:22 UTC
I'll try to get a copy of the new code, I'm not so sure I can test it though as
the circumstances are quite particular in reproducing this problem and I can't
really take down our live site to test it. I'll at least read through it to see
if the blocking scenerio can take place.

As for the node, yes I think so if I understand your question properly, we only
have 1 node name, and 1 domain name, and 1 context which is distributable.
Comment 3 Ted Leung 2005-12-15 00:25:14 UTC
Okay I had a look at the new code and I'm pretty sure the problem will still
manifest itself. 

The problem isn't with the code perse but I think it's a problem with the design
of the replication. Since the replication is in-line with the HTTP request
(seems like new session requests are inline even in asynch mode), if something
bad happens to the replication socket which causes it to hang, it will block
that thread (presumably until SO_TIMEOUT which is usually 2 minutes).

If a subsequent request comes in for a new session, it will try to do the same
thing but it needs to aquire a lock on the DataSender object via sendMessage();
therefore, this thread will now block until the previous thread timesout. As a
result a lot of threads may backup while waiting for the original socket to timeout.

Note that this could happen even in pooled mode, if all 300 threads had their
own 300 replicator sockets, and each sat there waiting to timeout for 2 minutes,
then there wuld be no more threads available (assuming I'm maxed at 300 threads).

The only things I can really think of to resolve this is to have the replication
on a separate thread to the http requests. 

Of course maybe I'm just mis-reading the code and everything I'm saying might be
wrong...
Comment 4 Rainer Jung 2005-12-15 01:23:58 UTC
Fast async has been designed to be used asynchronously. With such a
configuration, no tcp communication for replication will be done during the
request-response lifecycle. Instead all replication messages will only put into
a local queue. Seperate threads will pick up these messages an send them. The
queue will be locked while taking out the messages, but the lock will b freed
before actually trying to send a message.
Comment 5 Peter Rossbach 2005-12-15 08:57:49 UTC
Hmm,

when you don't want update complete to 5.5.15 (svn head) you can
compile the cluster module and copy the resulting catalina-cluster.jar inside
5.5.12 release. 

The default cluster configuration normaly used fastasyncqueue, the problem is
that the subclass not implement the right method. Now is fixed and all is
working well.

Peter