Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-5719

HA becomes unresponsive once any of the brokers are SIGSTOPed

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.28
    • Fix Version/s: 0.29
    • Component/s: C++ Clustering
    • Labels:
      None

      Description

      See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638

      Description of problem:

      qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.

      There are three different cases:
      a] stopped ALL brokers
      b] stopped the primary
      c] stopped a backup

      In any of above listed cases following observations were made:

      a-c] RHCS clustat is just fine and report everything is just ok
      a-c] qpid-ha (status --all) hangs
      a,b,c*] any other clients are indefinitely blocked
      a-b] cases directly at the beginning
      c] case at the end, client able to recover after minute or so,
      due to connection timeout

      In fact this defect also proves that qpid-ha can be out of sync when compared to clustat as tracked by BZ.

      The expectations are:

      • a] quorum lost HA down (same as kill -9 to all nodes)
        no clients able to communicate
      • b] promotion of new primary, there has to be mechanism to get rid of stopped process
        clients should be able to communicate after recovery
      • c] unresponsive backup should get restarted
        clients should be able to communicate after duration when backup is detected as unresponsive
      • Generally better integration Qpid HA environment <-> RHCS is needed
        aka SIGSTOP detection
      • Heartbeat primary <-> backups probably needed

        Attachments

        1. ha-heartbeat.diff
          14 kB
          Alan Conway

          Activity

            People

            • Assignee:
              aconway Alan Conway
              Reporter:
              aconway Alan Conway
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: