Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-8576

BDB HA Virtual Host start-up can hang when majority is lost during message store recovery operations

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • qpid-java-broker-8.0.6
    • None
    • Broker-J
    • None

    Description

      There is a risk of BDB HA Virtual Host start-up hanging when replica majority is lost during message store recovery operations. The stack trace like below is reported for the issue

      2022-02-07 11:08:42,086 INFO  [VirtualHostNode-node1-Config] (q.m.h.quorum_lost) - [Broker] [grp(/myha)/vhn(/node1)] HA-1009 : Insufficient replicas contactable
      2022-02-07 11:08:42,087 ERROR [VirtualHostNode-node1-Config] (o.a.q.s.m.AbstractConfiguredObject) - Failed to open object with name 'myha'.  Object will be put into ERROR state.
      org.apache.qpid.server.util.ConnectionScopedRuntimeException: Required number of nodes not reachable
      	at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.handleDatabaseException(ReplicatedEnvironmentFacade.java:496)
      	at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.commit(ReplicatedEnvironmentFacade.java:333)
      	at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.removeMessage(AbstractBDBMessageStore.java:354)
      	at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore$StoredBDBMessage.remove(AbstractBDBMessageStore.java:1205)
      	at org.apache.qpid.server.virtualhost.SynchronousMessageStoreRecoverer.recover(SynchronousMessageStoreRecoverer.java:136)
      	at org.apache.qpid.server.virtualhost.AbstractVirtualHost.postCreateDefaultExchangeTasks(AbstractVirtualHost.java:2811)
      	at org.apache.qpid.server.virtualhost.AbstractVirtualHost.onActivate(AbstractVirtualHost.java:2757)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1526)
      	at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1505)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1072)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1066)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$24$1.run(AbstractConfiguredObject.java:2894)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$24$1.run(AbstractConfiguredObject.java:2890)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:360)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$24.onSuccess(AbstractConfiguredObject.java:2889)
      	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1078)
      	at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$ImmediateIfSameThreadExecutor.execute(TaskExecutorImpl.java:400)
      	at org.apache.qpid.server.configuration.updater.TaskExecutorImpl.execute(TaskExecutorImpl.java:183)
      	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1164)
      	at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:708)
      	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:113)
      	at com.google.common.util.concurrent.Futures.addCallback(Futures.java:1045)
      	at org.apache.qpid.server.model.AbstractConfiguredObject.addFutureCallback(AbstractConfiguredObject.java:2884)
      	at org.apache.qpid.server.model.AbstractConfiguredObject.doAttainState(AbstractConfiguredObject.java:1065)
      	at org.apache.qpid.server.model.AbstractConfiguredObject.access$600(AbstractConfiguredObject.java:97)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:591)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:578)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:639)
      	at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:632)
      	at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$TaskLoggingWrapper.execute(TaskExecutorImpl.java:248)
      	at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper$1.run(TaskExecutorImpl.java:320)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:360)
      	at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper.call(TaskExecutorImpl.java:313)
      	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
      	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
      	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: com.sleepycat.je.rep.InsufficientAcksException: (JE 7.4.5) Transaction: -4128977263  VLSN: 12,383,163,414, initiated at: 11:08:27.  Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 3. Missing replica acks: 1. Timeout: 15000ms. FeederState=node1(4)[UNKNOWN]
      No feeders.
      
      	at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205)
      	at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189)
      	at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426)
      	at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385)
      	at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:228)
      	at com.sleepycat.je.txn.Txn.commit(Txn.java:772)
      	at com.sleepycat.je.Transaction.doCommit(Transaction.java:621)
      	at com.sleepycat.je.Transaction.commit(Transaction.java:401)
      	at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.commit(ReplicatedEnvironmentFacade.java:329)
      	... 44 common frames omitted
      2022-02-07 11:08:42,096 INFO  [UNKNOWN node1(4)] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - The node 'myha:node1' state is REPLICA
      2022-02-07 11:08:43,704 INFO  [StateChange-myha:node1] (q.m.h.role_changed) - [Broker] [grp(/myha)] HA-1010 : Role change reported: Node : 'node1' (iaasn00348250:5000) : from 'WAITING' to 'MASTER'
      2022-02-07 11:08:43,704 INFO  [StateChange-myha:node1] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - The environment facade is in open state for node myha:node1
      2022-02-07 11:08:43,704 ERROR [Environment-myha:node1] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Failed to restart environment.
      
      

      The VH needs to handle a majority loss gracefully by restarting the environment in order to rejoin the cluster

      Attachments

        Activity

          People

            Unassigned Unassigned
            orudyy Alex Rudyy
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: