Qpid
  1. Qpid
  2. QPID-4360

Non-ready HA broker can be incorrectly promoted

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.18
    • Fix Version/s: 0.19
    • Component/s: C++ Clustering
    • Labels:
      None

      Description

      Description of problem:
      rgmanager can promote a non-ready backup HA broker to primary when other backup brokers are available in the ready state. This can result in loss of messages and broker configuration. Additionally, this can cause the previously ready backups to throw exceptions when connecting to the new primary:

      Sep 20 10:17:18 itcm12 qpidd[10871]: 2012-09-20 10:17:18 [HA] critical Backup queue Queue1: Replication failed: Invalid position move, preceeds messages
      Sep 20 10:17:18 itcm12 qpidd[10871]: 2012-09-20 10:17:18 [Protocol] error Unexpected exception: Invalid position move, preceeds messages
      Sep 20 10:17:18 itcm12 qpidd[10871]: 2012-09-20 10:17:18 [Broker] error Connection 10.3.100.12:43837-10.3.100.105:9006 closed by error: Invalid position move, preceeds messages(501)

      Version-Release number of selected component (if applicable):
      Qpid 0.18

      How reproducible:
      100%

      Steps to Reproduce:
      1. Start a primary and backup broker
      2. Inject messages into the primary and ensure messages replicate to backup
      3. Restart the primary broker and manually re-promote to primary

      Actual results:
      Restarted broker becomes primary

      Expected results:
      Restarted broker refuses to become primary since at least one ready backup was discovered within some timeout

        Activity

        Hide
        Alan Conway added a comment -

        There was a test bug in the initial checkin, fixed on trunk:

        ------------------------------------------------------------------------
        r1396244 | aconway | 2012-10-09 15:52:24 -0400 (Tue, 09 Oct 2012) | 7 lines

        QPID-4360: Fix test bug: Non-ready HA broker can be incorrectly promoted to primary.

        Test test_delete_missing_response was failing with "cluster active, cannot promote".

        • Fixed test bug: "fake" primary triggered "cannot promote".
        • Backup: always create QueueReplicator if not already existing.
        • Terminology change: "initial" queues -> "catch-up" queues.

        ------------------------------------------------------------------------

        Show
        Alan Conway added a comment - There was a test bug in the initial checkin, fixed on trunk: ------------------------------------------------------------------------ r1396244 | aconway | 2012-10-09 15:52:24 -0400 (Tue, 09 Oct 2012) | 7 lines QPID-4360 : Fix test bug: Non-ready HA broker can be incorrectly promoted to primary. Test test_delete_missing_response was failing with "cluster active, cannot promote". Fixed test bug: "fake" primary triggered "cannot promote". Backup: always create QueueReplicator if not already existing. Terminology change: "initial" queues -> "catch-up" queues. ------------------------------------------------------------------------
        Hide
        Alan Conway added a comment -

        Comitted on trunk

        ------------------------------------------------------------------------
        r1394706 | aconway | 2012-10-05 14:21:45 -0400 (Fri, 05 Oct 2012) | 10 lines

        QPID-4360: Non-ready HA broker can be incorrectly promoted to primary

        A joining broker now attempts to contact all known members of the cluster and
        check their status. If any brokers are in a state other than "joining" the
        broker will refuse to promote. This will allow rgmanager to continue to try
        addresses till it finds a ready brokers.

        Note this reqiures ha-brokers-url to be a list of all known brokers, not a
        virtual IP. ha-public-url can still be a VIP.

        ------------------------------------------------------------------------

        Show
        Alan Conway added a comment - Comitted on trunk ------------------------------------------------------------------------ r1394706 | aconway | 2012-10-05 14:21:45 -0400 (Fri, 05 Oct 2012) | 10 lines QPID-4360 : Non-ready HA broker can be incorrectly promoted to primary A joining broker now attempts to contact all known members of the cluster and check their status. If any brokers are in a state other than "joining" the broker will refuse to promote. This will allow rgmanager to continue to try addresses till it finds a ready brokers. Note this reqiures ha-brokers-url to be a list of all known brokers, not a virtual IP. ha-public-url can still be a VIP. ------------------------------------------------------------------------

          People

          • Assignee:
            Alan Conway
            Reporter:
            Alan Conway
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development