36540 – pooled cluster replication does not seem ensure synchronized replication in tomcat 5.5.11

Bug 36540 - pooled cluster replication does not seem ensure synchronized replication in tomcat 5.5.11

Summary: pooled cluster replication does not seem ensure synchronized replication in t...

Status:	RESOLVED FIXED

Alias:	None

Product:	Tomcat 5
Classification:	Unclassified
Component:	Catalina:Cluster (show other bugs)
Version:	5.5.11
Hardware:	Other other

Importance:	P2 enhancement (vote)
Target Milestone:	---
Assignee:	Tomcat Developers Mailing List

URL:
Keywords:

Duplicates (1):	36542 (view as bug list)
Depends on:
Blocks:

Reported:	2005-09-07 12:40 UTC by Christoph Bachhuber-Haller
Modified:	2007-08-14 01:23 UTC (History)
CC List:	0 users

Attachments
the test case and config files (13.85 KB, application/octet-stream) 2005-09-07 12:42 UTC, Christoph Bachhuber-Haller	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Christoph Bachhuber-Haller 2005-09-07 12:40:44 UTC

If I understand it correctly, the pooled replication should ensure the
availability of the session on all cluster nodes as soon as the request is
finished. It doesn't seem to do that with the following test-case:

The attached simple webapp (clustertest.war) just prints the content of the
session entry "host" and then stores its hostname in that entry. 

This webapp is installed on 2 tomcat 5.5.11's, tomcat1 and tomcat2, and mod_jk
is used as a load balancer, sticky sessions are disabled. (Config files are
attached).

A JMeter 2.1 Test Plan is used to make requests. It repeatedly makes requests to
the webapp, waiting 20ms between the requests.

Expected behaviour:
The first request yields null, and subsequent ones return tomcat1, then tomcat2,
tomcat1, tomcat2 and so on. 

Actual behaviour:
Several requests return null, and the rest is not exactly alternating. 

In this test case, it stabilized when waiting like 50ms between the requests,
but on other webapps with larger sessions, it needed a lot more time. This is
quite serious since correctness is not ensured when not using sticky sessions.

Comment 1 Christoph Bachhuber-Haller 2005-09-07 12:42:15 UTC

Created attachment 16329 [details]
the test case and config files

Comment 2 Remy Maucherat 2005-09-07 12:48:54 UTC

How about not filing a bug ? It should be evident sticky sessions are mandatory,
or you're going to run into problems.

Comment 3 Remy Maucherat 2005-09-07 13:21:42 UTC

*** Bug 36542 has been marked as a duplicate of this bug. ***

Comment 4 Christoph Bachhuber-Haller 2005-09-09 10:48:34 UTC

I did some further testing, and my test case works as expected with tomcat 5.5.9
without the cluster fix pack (Bug 34389) So you have a regression there, which
should be fixed. Anyway, this bug isn't INVALID, it's either a bug, or a
WONTFIX, so I reopen it. But you should really fix that, since many load
balancers don't support tomcat's sticky sessions.

Comment 5 Peter Rossbach 2005-09-18 10:36:37 UTC

As Remy tell you, sticky session is mandatory that
tomcat clustering works. Clustering is a fallback mechanism.

Peter

Comment 6 Filip Hanik 2005-09-19 23:57:48 UTC

I don't agree with that at all, sticky sessions is a great benefit, but pooled
synchronized cluster should be working.

at least it was in the good ol' days :)

Peter, do you know what changed in the "cluster fix pack" that made the synch
not work properly anymore?

Filip

Comment 7 Remy Maucherat 2005-09-20 00:11:37 UTC

(In reply to comment #6)
> I don't agree with that at all, sticky sessions is a great benefit, but pooled
> synchronized cluster should be working.
> 
> at least it was in the good ol' days :)

No, it never was: the request might be complete from a HTTP standpoint while the
servlet is still running (= the client might send the next request, while the
replication hasn't been done yet). Besides, the spec requires that all
concurrent requests which belong to the same session be processed by the same host.

Non sticky sessions is broken for many cases, period. I'll let you close this as
INVALID again.

Comment 8 Filip Hanik 2005-09-20 00:26:05 UTC

>No, it never was: the request might be complete from a HTTP standpoint while the
>servlet is still running (= the client might send the next request, while the
>replication hasn't been done yet).

yes, this scenario has never been supported. that is correct.
But single thread client synchronization has always worked.

Christoph, if the scenario that you have is a single client thread per session,
then let us know, otherwise we will close this bug.

Comment 9 Remy Maucherat 2005-09-20 00:35:42 UTC

(In reply to comment #8)
> yes, this scenario has never been supported. that is correct.
> But single thread client synchronization has always worked.

Yes, many webapps would work very well with this mode, while some others would not.

For this situation, taking several ms to replicate doesn't seem particularly
broken to me, although I suppose the 5.5.10 changelog is quite long.

Comment 10 Christoph Bachhuber-Haller 2005-09-20 19:36:47 UTC

(In reply to comment #9)
> Christoph, if the scenario that you have is a single client thread per session,
> then let us know, otherwise we will close this bug.

My Test Case above is actually quite simple. A single JMeter thread runs
subsequent requests on a cluster of 2 Tomcat servers. A jk load balancer without
sticky sessions (for testing only) does the distribution. The jsp page just
stores the hostname of the tomcat server in the session. No fancy
multi-threading or anything. TC 5.5.9 handles the situation correctly and
replicates the session before finishing the http request. 5.5.11 does not. So
pooled replication in definitely broken.

Please either fix it or document that it's broken. If you should choose not to
fix it, pooled, synchronous and asynchronous replication modes are basically
useless and should be removed. 

Thanks for pointing out that the servlet specs (SRV.7.7.2) state that "Within an
application marked as distributable, all requests that are part of a session
must be handled by one Java Virtual Machine1 ( JVM ) at a time."  But in my
opinion that means that when the requests are finished, the session state should
already be replicated on all other cluster members. A Session with a single 7
char String object seems to need about 20 ms to replicate in my setup. Busy
servers with large Sessions may get into some hundreds of ms and this is a
problem for me.

Bye,
Christoph

Comment 11 Filip Hanik 2005-09-21 13:42:21 UTC

Its been a while since I ran my test cases, but I will do it again

Comment 12 Peter Rossbach 2005-09-23 17:37:40 UTC

The differenc between 5.5.9 and 5.5.11 is that waitForAck is on default false
for all sender modes. 
Can you check with this config:
            <Sender
                className="org.apache.catalina.cluster.tcp.ReplicationTransmitter"
                replicationMode="pooled"
                waitForAck="true"
                doTransmitterProcessingStats="true"
                doProcessingStats="true"
                doWaitAckStats="true"
                ackTimeout="15000"/>

thanks
Peter

Comment 13 Filip Hanik 2005-09-30 23:56:45 UTC

I would argue the ackTimeout=15000 should be an indicator that wait for ack =true,
do you really need two flags to say the same thing?
To simplify the implementation, I would use the following logic, and remove the
"waitForAck" flag all together.

ackTimeout > 0 - wait for ack true, and time out set
ackTimeout = 0 - wait for ack false
ackTimeout = -1 - wait for ack true, no timeout

Comment 14 Christoph Bachhuber-Haller 2005-10-04 14:13:33 UTC

(In reply to comment #12)
> The differenc between 5.5.9 and 5.5.11 is that waitForAck is on default false
> for all sender modes. 
> Can you check with this config:
<snip>

The waitForAck was my problem indeed. Thanks you very much for pointing out. 

But now I'm a bit confused about the difference between pooled and
fastasyncqueue cluster replication really is. The docs state "synchronous
replication guarantees the session to be replicated before the request returns."
But obviousely waitForAck is the controlling setting. Isn't it best to just
ignore the waitForAck setting and have it set false for fastasyncqueue and true
for pooled? 

The docs also state that "Asynchronous replication, should be used if you have
sticky sessions until fail over", which implies that pooled should be used else.
But as this bug report turned out, sticky sessions are mandatory for correct
clustering. Maybe you should change the docs accordingly (This should go into
another bug report imo, but my bug 36542 was resolved as duplicate)

Thanks, 
Christoph

Comment 15 Rainer Jung 2005-10-04 21:01:50 UTC

synchronous: send each session change to other cluster members before returning
response to client.

asychronous: same as synchronous, but use mutiple sender connections (use any
one, that is not currently busy).

fastasync: put session change message into local queue and then return response
to client. A seperate thread waits for messages coming into the queue and then
send the messages to the other cluster members.

waitforack: when ending the message, wait for an ACK type answering message from
the other cluster members before proceeding (make sending the messages more
reliable).

If one needs exact synchronization: synchronous are pooled mode with waitforack.
Application gets into trouble, when replication gets stuck.

If one can live with some latency between changes on the primary node and their
replication to the other nodes and on the other hand the cluster should
influence application performance and stability only very little: use session
stickyness in load balancers combined with fastasync and no waitforack.

synchronous/pooled without waitforack: lower latency for replication, although
synchronization is not exact.

fastasync with waitforack: decoupling replication from request/response but
ensuring that replication is checked for success.

You are right, we should make the docs more precise. The features are yet very
new and as usual documentation takes a while.

Comment 16 Yoav Shapira 2005-12-02 16:55:50 UTC

Please (anyone involved in this issue) submit the doc enhancements you'd like to
see.  I'll be glad to quickly look at them and commit them for the next release.

Comment 17 Mark Thomas 2006-10-05 14:56:21 UTC

Just marking as an enhancement.

Comment 18 Peter Rossbach 2007-08-14 01:23:52 UTC

Discussion show it is a user list question not a bug.