[QPID-5942] qpid HA cluster may end-up in joining state after HA primary is killed - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.28
Fix Version/s: 0.30
Component/s: C++ Clustering
Labels:
None

Description

See also: https://bugzilla.redhat.com/show_bug.cgi?id=1117823

Description of problem:

qpid HA cluster may end-up in joining state after HA primary is killed.

Test scenario.
Let's have 3 node qpid HA cluster, all three nodes are operational.
Then a sender is executed and sending to queue (pure transactional with durable messages and durable queue address).
During that process primary broker is killed multiple times.
After N'th primary broker kill cluster is no longer functional as qpid brokers are ending all in joining states:

[root@dhcp-lab-216 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining
[root@dhcp-x-216 ~]# clustat
Cluster Status for dtests_ha @ Wed Jul 9 14:38:44 2014
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
192.168.6.60 1 Online, Local, rgmanager
192.168.6.61 2 Online, rgmanager
192.168.6.62 3 Online, rgmanager

Service Name Owner (Last) State
------- ---- ----- ------ -----
service:qpidd_1 192.168.6.60 started
service:qpidd_2 192.168.6.61 started
service:qpidd_3 192.168.6.62 started
service:qpidd_primary (192.168.6.62) stopped

[root@dhcp-x-165 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining

[root@dhcp-x-218 ~]# qpid-ha status --all
192.168.6.60:5672 joining
192.168.6.61:5672 joining
192.168.6.62:5672 joining

I believe the key to hit the issue is to kill the newly promoted primary soon after it starts appearing in starting/started state in clustat.

My current understanding is that if we have 3 node cluster then applying any failures to single node at one time should be handled by HA. This is what the testing scenario does:
A B C (nodes)
pri bck bck
kill
bck pri bck
kill
bck bck pri
kill
...
pri bck bck
kill
bck bck bck

It looks to me that there is short time when promoting new primary when kill causes (of such primary newbee) causes promotion procedure to stuck in all joining.

I haven't seen such behavior in past, either we are now more sensitive to such case (after -STOP case fixes) or the durability turned on rapidly raises the probability.

Version-Release number of selected component (if applicable):

rpm -qa | grep qpid | sort
perl-qpid-0.22-13.el6.i686
perl-qpid-debuginfo-0.22-13.el6.i686
python-qpid-0.22-15.el6.noarch
python-qpid-proton-doc-0.5-9.el6.noarch
python-qpid-qmf-0.22-33.el6.i686
qpid-cpp-client-0.22-42.el6.i686
qpid-cpp-client-devel-0.22-42.el6.i686
qpid-cpp-client-devel-docs-0.22-42.el6.noarch
qpid-cpp-client-rdma-0.22-42.el6.i686
qpid-cpp-debuginfo-0.22-42.el6.i686
qpid-cpp-server-0.22-42.el6.i686
qpid-cpp-server-devel-0.22-42.el6.i686
qpid-cpp-server-ha-0.22-42.el6.i686
qpid-cpp-server-linearstore-0.22-42.el6.i686
qpid-cpp-server-rdma-0.22-42.el6.i686
qpid-cpp-server-xml-0.22-42.el6.i686
qpid-java-client-0.22-6.el6.noarch
qpid-java-common-0.22-6.el6.noarch
qpid-java-example-0.22-6.el6.noarch
qpid-jca-0.22-2.el6.noarch
qpid-jca-xarecovery-0.22-2.el6.noarch
qpid-jca-zip-0.22-2.el6.noarch
qpid-proton-c-0.7-2.el6.i686
qpid-proton-c-devel-0.7-2.el6.i686
qpid-proton-c-devel-doc-0.5-9.el6.noarch
qpid-proton-debuginfo-0.7-2.el6.i686
qpid-qmf-0.22-33.el6.i686
qpid-qmf-debuginfo-0.22-33.el6.i686
qpid-qmf-devel-0.22-33.el6.i686
qpid-snmpd-1.0.0-16.el6.i686
qpid-snmpd-debuginfo-1.0.0-16.el6.i686
qpid-tests-0.22-15.el6.noarch
qpid-tools-0.22-13.el6.noarch
ruby-qpid-qmf-0.22-33.el6.i686

How reproducible:
rarely, timing is the key

Steps to Reproduce:
1. have configured 3 node cluster
2. start the whole cluster up
3. execute transactional sender to durable queue address with durable messages and reconnect
4. repeatedly kill the primary broker once it is promoted

Actual results:
After few kills cluster ends up not functional all in joining. Ability to bring qpid HA down by inserting single isolated failures to newly being promoted brokers.

Expected results:
Qpid HA should be single failure at one time tolerant.

Additional info:
Details on failure insertion:

kill -9 `pidof qpidd` is the failure action
Assuming the duration between failure insertion and primary is ready to serve named as T1
failure insertion period T2 > T1 i.e. there are no cummulative failures inserted while HA is getting through new primary promotion
-> this fact (in my view) proves that there is real issue

Attachments

Issue Links

is duplicated by

QPID-5904 qpid HA cluster may end-up in joining state after HA primary is killed

Closed

qpid HA cluster may end-up in joining state after HA primary is killed

Details

Description

Attachments

Issue Links

Activity

People

Dates