Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-5942

qpid HA cluster may end-up in joining state after HA primary is killed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.28
    • 0.30
    • C++ Clustering
    • None

    Description

      See also: https://bugzilla.redhat.com/show_bug.cgi?id=1117823

      Description of problem:

      qpid HA cluster may end-up in joining state after HA primary is killed.

      Test scenario.
      Let's have 3 node qpid HA cluster, all three nodes are operational.
      Then a sender is executed and sending to queue (pure transactional with durable messages and durable queue address).
      During that process primary broker is killed multiple times.
      After N'th primary broker kill cluster is no longer functional as qpid brokers are ending all in joining states:

      [root@dhcp-lab-216 ~]# qpid-ha status --all
      192.168.6.60:5672 joining
      192.168.6.61:5672 joining
      192.168.6.62:5672 joining
      [root@dhcp-x-216 ~]# clustat
      Cluster Status for dtests_ha @ Wed Jul 9 14:38:44 2014
      Member Status: Quorate

      Member Name ID Status
      ------ ---- ---- ------
      192.168.6.60 1 Online, Local, rgmanager
      192.168.6.61 2 Online, rgmanager
      192.168.6.62 3 Online, rgmanager

      Service Name Owner (Last) State
      ------- ---- ----- ------ -----
      service:qpidd_1 192.168.6.60 started
      service:qpidd_2 192.168.6.61 started
      service:qpidd_3 192.168.6.62 started
      service:qpidd_primary (192.168.6.62) stopped

      [root@dhcp-x-165 ~]# qpid-ha status --all
      192.168.6.60:5672 joining
      192.168.6.61:5672 joining
      192.168.6.62:5672 joining

      [root@dhcp-x-218 ~]# qpid-ha status --all
      192.168.6.60:5672 joining
      192.168.6.61:5672 joining
      192.168.6.62:5672 joining

      I believe the key to hit the issue is to kill the newly promoted primary soon after it starts appearing in starting/started state in clustat.

      My current understanding is that if we have 3 node cluster then applying any failures to single node at one time should be handled by HA. This is what the testing scenario does:
      A B C (nodes)
      pri bck bck
      kill
      bck pri bck
      kill
      bck bck pri
      kill
      ...
      pri bck bck
      kill
      bck bck bck

      It looks to me that there is short time when promoting new primary when kill causes (of such primary newbee) causes promotion procedure to stuck in all joining.

      I haven't seen such behavior in past, either we are now more sensitive to such case (after -STOP case fixes) or the durability turned on rapidly raises the probability.

      Version-Release number of selected component (if applicable):

      1. rpm -qa | grep qpid | sort
        perl-qpid-0.22-13.el6.i686
        perl-qpid-debuginfo-0.22-13.el6.i686
        python-qpid-0.22-15.el6.noarch
        python-qpid-proton-doc-0.5-9.el6.noarch
        python-qpid-qmf-0.22-33.el6.i686
        qpid-cpp-client-0.22-42.el6.i686
        qpid-cpp-client-devel-0.22-42.el6.i686
        qpid-cpp-client-devel-docs-0.22-42.el6.noarch
        qpid-cpp-client-rdma-0.22-42.el6.i686
        qpid-cpp-debuginfo-0.22-42.el6.i686
        qpid-cpp-server-0.22-42.el6.i686
        qpid-cpp-server-devel-0.22-42.el6.i686
        qpid-cpp-server-ha-0.22-42.el6.i686
        qpid-cpp-server-linearstore-0.22-42.el6.i686
        qpid-cpp-server-rdma-0.22-42.el6.i686
        qpid-cpp-server-xml-0.22-42.el6.i686
        qpid-java-client-0.22-6.el6.noarch
        qpid-java-common-0.22-6.el6.noarch
        qpid-java-example-0.22-6.el6.noarch
        qpid-jca-0.22-2.el6.noarch
        qpid-jca-xarecovery-0.22-2.el6.noarch
        qpid-jca-zip-0.22-2.el6.noarch
        qpid-proton-c-0.7-2.el6.i686
        qpid-proton-c-devel-0.7-2.el6.i686
        qpid-proton-c-devel-doc-0.5-9.el6.noarch
        qpid-proton-debuginfo-0.7-2.el6.i686
        qpid-qmf-0.22-33.el6.i686
        qpid-qmf-debuginfo-0.22-33.el6.i686
        qpid-qmf-devel-0.22-33.el6.i686
        qpid-snmpd-1.0.0-16.el6.i686
        qpid-snmpd-debuginfo-1.0.0-16.el6.i686
        qpid-tests-0.22-15.el6.noarch
        qpid-tools-0.22-13.el6.noarch
        ruby-qpid-qmf-0.22-33.el6.i686

      How reproducible:
      rarely, timing is the key

      Steps to Reproduce:
      1. have configured 3 node cluster
      2. start the whole cluster up
      3. execute transactional sender to durable queue address with durable messages and reconnect
      4. repeatedly kill the primary broker once it is promoted

      Actual results:
      After few kills cluster ends up not functional all in joining. Ability to bring qpid HA down by inserting single isolated failures to newly being promoted brokers.

      Expected results:
      Qpid HA should be single failure at one time tolerant.

      Additional info:
      Details on failure insertion:

      • kill -9 `pidof qpidd` is the failure action
      • Assuming the duration between failure insertion and primary is ready to serve named as T1
      • failure insertion period T2 > T1 i.e. there are no cummulative failures inserted while HA is getting through new primary promotion
        -> this fact (in my view) proves that there is real issue

      Attachments

        Issue Links

          Activity

            People

              aconway Alan Conway
              aconway Alan Conway
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: