Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-5007

Qpid HA cluster does not support failback in an ordered domain.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.22
    • Fix Version/s: None
    • Component/s: C++ Clustering
    • Labels:
      None

      Description

      rgmanager has the notion of an ordered domain, where it will try to start services on the highest priority node in the domain.
      (see https://fedorahosted.org/cluster/wiki/FailoverDomains)

      The problem arises like this:

      • start a 2 node cluster with an ordered domain.
      • Create a queue and put and put enough messages on so that catchup takes longer than the time to restart node1
      • kill node1, rgmanager relocates qpidd-primary service to node2
      • immediately restart node1
      • rgmanager wants to relocate the service to node1 so it:
      • kills the primary on node2 as first step of relocation
      • attempts to restart the primary on node1 which fails
        because it is still in catchup and there is no primary to catch up
        from.
      • at this point we get into an infinite loop of failed attempts to
        restart the primary.

      The workaround is to set the nofailback option on the domain.

      See also: https://bugzilla.redhat.com/show_bug.cgi?id=970657

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aconway Alan Conway
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: