Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-6972

BDB HA: Node may remain detached from group following loss of quorum

    XMLWordPrintableJSON

Details

    Description

      If a master detects that it has lost quorum (which may occur owing to a user generated transaction, or an internally generated 'ping' transaction, failing to see the required number of replica acknowledgements), the underlying JE environment ReplicatedEnvironment is automatically restarted (the old one closed and a new one created to replace it). This approach ensures that clients reconnect to a new master in a timely way.

      There is a coding error in the CoalescingCommitter that means that the JE environment restart may not complete properly. If quorum disappears whilst there are jobs on the CoalescingCommitter's job queue, the CoalescingCommitter's error handling will cause the BDB EnvironmentFacade to be closed. This is okay for the BDB non-HA case as such an exception is always fatal, but for HA, calling ReplicatedEnvironmentFacade#close() prevents the environment from being recreated.

      This effect of this defect is that a node may disappear from the group every time quorum is temporarily lost. This will keep occuring until quorum no longer remains, at which point the business will stop. Bouncing the affected brokers (or restarting the VHNs) will restore the service, without message loss.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kwall Keith Wall
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: