Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-5719

HA becomes unresponsive once any of the brokers are SIGSTOPed



    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.28
    • 0.29
    • C++ Clustering
    • None


      See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638

      Description of problem:

      qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.

      There are three different cases:
      a] stopped ALL brokers
      b] stopped the primary
      c] stopped a backup

      In any of above listed cases following observations were made:

      a-c] RHCS clustat is just fine and report everything is just ok
      a-c] qpid-ha (status --all) hangs
      a,b,c*] any other clients are indefinitely blocked
      a-b] cases directly at the beginning
      c] case at the end, client able to recover after minute or so,
      due to connection timeout

      In fact this defect also proves that qpid-ha can be out of sync when compared to clustat as tracked by BZ.

      The expectations are:

      • a] quorum lost HA down (same as kill -9 to all nodes)
        no clients able to communicate
      • b] promotion of new primary, there has to be mechanism to get rid of stopped process
        clients should be able to communicate after recovery
      • c] unresponsive backup should get restarted
        clients should be able to communicate after duration when backup is detected as unresponsive
      • Generally better integration Qpid HA environment <-> RHCS is needed
        aka SIGSTOP detection
      • Heartbeat primary <-> backups probably needed


        1. ha-heartbeat.diff
          14 kB
          Alan Conway



            aconway Alan Conway
            aconway Alan Conway
            0 Vote for this issue
            3 Start watching this issue