Frantisek Reznicek 2014-07-09 08:59:30 EDT
Description of problem:
qpid HA cluster may end-up in joining state after HA primary is killed.
Let's have 3 node qpid HA cluster, all three nodes are operational.
Then a sender is executed and sending to queue (pure transactional with durable messages and durable queue address).
During that process primary broker is killed multiple times.
After N'th primary broker kill cluster is no longer functional as qpid brokers are ending all in joining states:
[root@dhcp-lab-216 ~]# qpid-ha status --all
[root@dhcp-x-216 ~]# clustat
Cluster Status for dtests_ha @ Wed Jul 9 14:38:44 2014
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
192.168.6.60 1 Online, Local, rgmanager
192.168.6.61 2 Online, rgmanager
192.168.6.62 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:qpidd_1 192.168.6.60 started
service:qpidd_2 192.168.6.61 started
service:qpidd_3 192.168.6.62 started
service:qpidd_primary (192.168.6.62) stopped
[root@dhcp-x-165 ~]# qpid-ha status --all
[root@dhcp-x-218 ~]# qpid-ha status --all
I believe the key to hit the issue is to kill the newly promoted primary soon after it starts appearing in starting/started state in clustat.
My current understanding is that if we have 3 node cluster then applying any failures to single node at one time should be handled by HA. This is what the testing scenario does:
A B C (nodes)
pri bck bck
bck pri bck
bck bck pri
pri bck bck
bck bck bck
It looks to me that there is short time when promoting new primary when kill causes (of such primary newbee) causes promotion procedure to stuck in all joining.
I haven't seen such behavior in past, either we are now more sensitive to such case (after -STOP case fixes) or the durability turned on rapidly raises the probability.
Version-Release number of selected component (if applicable):
- rpm -qa | grep qpid | sort
rarely, timing is the key
Steps to Reproduce:
1. have configured 3 node cluster
2. start the whole cluster up
3. execute transactional sender to durable queue address with durable messages and reconnect
4. repeatedly kill the primary broker once it is promoted
After few kills cluster ends up not functional all in joining. Ability to bring qpid HA down by inserting single isolated failures to newly being promoted brokers.
Qpid HA should be single failure at one time tolerant.
Details on failure insertion:
- kill -9 `pidof qpidd` is the failure action
- Assuming the duration between failure insertion and primary is ready to serve named as T1
- failure insertion period T2 > T1 i.e. there are no cummulative failures inserted while HA is getting through new primary promotion
-> this fact (in my view) proves that there is real issue