[QPID-4201] Destination cluster de-sync when federation link used for a longer time - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.18
Fix Version/s: 0.20
Component/s: C++ Clustering
Labels:
None

Description

(see also https://bugzilla.redhat.com/show_bug.cgi?id=836141)

Description of problem:
Using queue state replication from a broker (possibly clustered - this does not matter) to a cluster of brokers cause cluster de-sync after a long time:

2012-06-28 08:28:30 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: @QPID.77153a41-7531-47f6-bf55-b30ffed69922: confirmed < (4799+0) but only sent < (4797+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)

Version-Release number of selected component (if applicable):
every checked

How reproducible:
depending on time, but 10% for default scenario

Steps to Reproduce:
(ideally, if possible, rebuild qpid with changing cpp/src/qpid/SessionState.cpp: static const uint32_t SPONTANEOUS_REQUEST_INTERVAL = 64 to really, really significantly speedup the reproducer)

1) Have source broker (or cluster, this does not matter) and dest.cluster with queue state replication of just one queue from source do dest.cluster.
2) On the federation route, setup --ack to some low number (to speedup replication, I used --ack 5).
3) Randomly produce and consume messages to the src.broker to the queue to be replicated - ideally, do the enqueues and dequeues as much alternating as possible. Dont know why, but more alternates speeds up the reproducer as well.
4) Now, be patient. After sending SPONTANEOUS_REQUEST_INTERVAL (by default 64k) of some synchronization messages from the backup cluster (that requires around 100times more messages to be enqueued and dequeued on the replicated queue), there is a probability to hit the bug. Once it was hit on the first attempt (after 2^16 = 64k of such synchronization messages), once after 720896 messages (in 11th "round" / "trial").

Actual results:
All brokers in dst.cluster - except the one that has the fed.link established - shut down with log:

2012-06-27 15:39:46 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: @QPID.314e73e8-8bc3-4f5a-b77d-6bdd4ee17e39: confirmed < (720895+0) but only sent < (720893+0) (qpid/SessionState.cpp:154) (qpid/cluster/ErrorCheck.cpp:89)

Expected results:
No such cluster de-sync

Additional info:

interesting fact: I was able to reproduce it using queue state replication - only. Despite the bug is on federation link session, using fed.link without queue state replication did not lead to the bug.

the difference comes from the beginning of session communication, per some traces, these AMQP messages sent from dst.cluster to the source are not replayed by (even not multicasted to) the "other dst.brokers" (that have the session / connection as shadow, not local). So these messages are not replayed:

2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 0: {MessageSubscribeBody: queue=replication-queue; destination=replication-exchange; accept-mode=0; acquire-mode=0; resume-id resume-ttl=0; arguments=

{qpid.sync_frequency:F4:int32(100)}

; }
2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 1:

{MessageFlowBody: destination=replication-exchange; unit=0; value=4294967295; }

2012-06-27 07:12:09 trace @QPID.2d7fe3c3-b0de-4f36-a028-23ffaed6e9a5: sent cmd 2:

{MessageFlowBody: destination=replication-exchange; unit=1; value=4294967295; }

[reply] [-]
Private
Comment 1 Pavel Moravec

Attachments

Activity

People

Assignee:: Alan Conway

Reporter:: Alan Conway

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Aug/12 14:48

Updated:: 06/Sep/16 16:35

Resolved:: 17/Jan/13 16:32