Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
qpid-java-broker-8.0.6
-
None
-
None
Description
There is a risk of BDB HA Virtual Host start-up hanging when replica majority is lost during message store recovery operations. The stack trace like below is reported for the issue
2022-02-07 11:08:42,086 INFO [VirtualHostNode-node1-Config] (q.m.h.quorum_lost) - [Broker] [grp(/myha)/vhn(/node1)] HA-1009 : Insufficient replicas contactable 2022-02-07 11:08:42,087 ERROR [VirtualHostNode-node1-Config] (o.a.q.s.m.AbstractConfiguredObject) - Failed to open object with name 'myha'. Object will be put into ERROR state. org.apache.qpid.server.util.ConnectionScopedRuntimeException: Required number of nodes not reachable at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.handleDatabaseException(ReplicatedEnvironmentFacade.java:496) at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.commit(ReplicatedEnvironmentFacade.java:333) at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.removeMessage(AbstractBDBMessageStore.java:354) at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore$StoredBDBMessage.remove(AbstractBDBMessageStore.java:1205) at org.apache.qpid.server.virtualhost.SynchronousMessageStoreRecoverer.recover(SynchronousMessageStoreRecoverer.java:136) at org.apache.qpid.server.virtualhost.AbstractVirtualHost.postCreateDefaultExchangeTasks(AbstractVirtualHost.java:2811) at org.apache.qpid.server.virtualhost.AbstractVirtualHost.onActivate(AbstractVirtualHost.java:2757) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1526) at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1505) at org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1072) at org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1066) at org.apache.qpid.server.model.AbstractConfiguredObject$24$1.run(AbstractConfiguredObject.java:2894) at org.apache.qpid.server.model.AbstractConfiguredObject$24$1.run(AbstractConfiguredObject.java:2890) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.qpid.server.model.AbstractConfiguredObject$24.onSuccess(AbstractConfiguredObject.java:2889) at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1078) at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$ImmediateIfSameThreadExecutor.execute(TaskExecutorImpl.java:400) at org.apache.qpid.server.configuration.updater.TaskExecutorImpl.execute(TaskExecutorImpl.java:183) at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1164) at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:708) at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:113) at com.google.common.util.concurrent.Futures.addCallback(Futures.java:1045) at org.apache.qpid.server.model.AbstractConfiguredObject.addFutureCallback(AbstractConfiguredObject.java:2884) at org.apache.qpid.server.model.AbstractConfiguredObject.doAttainState(AbstractConfiguredObject.java:1065) at org.apache.qpid.server.model.AbstractConfiguredObject.access$600(AbstractConfiguredObject.java:97) at org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:591) at org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:578) at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:639) at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:632) at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$TaskLoggingWrapper.execute(TaskExecutorImpl.java:248) at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper$1.run(TaskExecutorImpl.java:320) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper.call(TaskExecutorImpl.java:313) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464) at java.lang.Thread.run(Thread.java:748) Caused by: com.sleepycat.je.rep.InsufficientAcksException: (JE 7.4.5) Transaction: -4128977263 VLSN: 12,383,163,414, initiated at: 11:08:27. Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 3. Missing replica acks: 1. Timeout: 15000ms. FeederState=node1(4)[UNKNOWN] No feeders. at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205) at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189) at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426) at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385) at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:228) at com.sleepycat.je.txn.Txn.commit(Txn.java:772) at com.sleepycat.je.Transaction.doCommit(Transaction.java:621) at com.sleepycat.je.Transaction.commit(Transaction.java:401) at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.commit(ReplicatedEnvironmentFacade.java:329) ... 44 common frames omitted 2022-02-07 11:08:42,096 INFO [UNKNOWN node1(4)] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - The node 'myha:node1' state is REPLICA 2022-02-07 11:08:43,704 INFO [StateChange-myha:node1] (q.m.h.role_changed) - [Broker] [grp(/myha)] HA-1010 : Role change reported: Node : 'node1' (iaasn00348250:5000) : from 'WAITING' to 'MASTER' 2022-02-07 11:08:43,704 INFO [StateChange-myha:node1] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - The environment facade is in open state for node myha:node1 2022-02-07 11:08:43,704 ERROR [Environment-myha:node1] (o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Failed to restart environment.
The VH needs to handle a majority loss gracefully by restarting the environment in order to rejoin the cluster