Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-7192

[Broker-J] BDB HA Virtual Host Node does not restart successfully if JE environment has locally committed transactions requiring rollback

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • qpid-java-6.0, qpid-java-6.1, qpid-java-broker-7.1.0, qpid-java-broker-7.0.6
    • None
    • Broker-J
    • None

    Description

      Start-up of BDB HA VHN fails in the following scenario:
      1) When Transaction on Master node is in process of commit and the replicas BDB HA VHNs are stopped at the same time, the transaction is aborted with InsufficientReplicasException
      2) BDB HA VHN for replicas are restarted and the Replica environment detects a transaction requiring rollback from previous commit. The rollback transaction causes restart of JE Environment.
      3) BDB HA VHN for impacted replica continues activation and performs an intruder protection checks but Environment is restarting which results in exception "ConnectionScopedRuntimeException: Environment is restarting". The exception puts BDB HA VHN into ERRORED state. JE environment re-join the group successfully but BDB HA VHN is not aware about it, as StateChangeListener is not set and BDB HA VHN does not receive notifications about state transitions

      A manual operator intervention is required to recover from the issue: BDB HA VHN needs to be started up again by invoking state change operation from Web Management Console or REST API.

      Here are JE Exceptions reported on Replica side on transaction rollback:

      BROKER-20 WARN  [DETACHED nodetestInFlightTransactionsWhilstMajorityIsLost10004(2)] o.a.q.s.s.b.r.ReplicatedEnvironmentFacade test:nodetestInFlightTransactionsWhilstMajorityIsLost10004 has transaction(s) ahead of the current master. These must be discarded to allow this node to rejoin the group. This condition is normally caused by the use of weak durability options.
      BROKER-20 DEBUG [DETACHED nodetestInFlightTransactionsWhilstMajorityIsLost10004(2)] o.a.q.s.s.b.r.ReplicatedEnvironmentFacade Environment restarting due to exception (JE 5.0.104) nodetestInFlightTransactionsWhilstMajorityIsLost10004(2):/tmp/qpid-work-org.apache.qpid.server.store.berkeleydb.replication.MultiNodeTest.testInFlightTransactionsWhilstMajorityIsLost-20-6184300293922905690/test/config Node nodetestInFlightTransactionsWhilstMajorityIsLost10004(2):/tmp/qpid-work-org.apache.qpid.server.store.berkeleydb.replication.MultiNodeTest.testInFlightTransactionsWhilstMajorityIsLost-20-6184300293922905690/test/config must rollback 3 commits to the earliest point indicated by transaction id=-142 time=2016-04-08 00:16:16.384 vlsn=356 lsn=0x0/0x8332 in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x0, offset 0x33469, vlsn 353 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed.
      com.sleepycat.je.rep.RollbackException: (JE 5.0.104) nodetestInFlightTransactionsWhilstMajorityIsLost10004(2):/tmp/qpid-work-org.apache.qpid.server.store.berkeleydb.replication.MultiNodeTest.testInFlightTransactionsWhilstMajorityIsLost-20-6184300293922905690/test/config Node nodetestInFlightTransactionsWhilstMajorityIsLost10004(2):/tmp/qpid-work-org.apache.qpid.server.store.berkeleydb.replication.MultiNodeTest.testInFlightTransactionsWhilstMajorityIsLost-20-6184300293922905690/test/config must rollback 3 commits to the earliest point indicated by transaction id=-142 time=2016-04-08 00:16:16.384 vlsn=356 lsn=0x0/0x8332 in order to rejoin the replication group. All existing ReplicatedEnvironment handles must be closed and reinstantiated.  Log files were truncated to file 0x0, offset 0x33469, vlsn 353 HARD_RECOVERY: Rolled back past transaction commit or abort. Must run recovery by re-opening Environment handles Environment is invalid and must be closed.
              at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.setupHardRecovery(ReplicaFeederSyncup.java:650) ~[je-5.0.104.jar:5.0.104]
              at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.verifyRollback(ReplicaFeederSyncup.java:341) ~[je-5.0.104.jar:5.0.104]
              at com.sleepycat.je.rep.stream.ReplicaFeederSyncup.execute(ReplicaFeederSyncup.java:148) ~[je-5.0.104.jar:5.0.104]
              at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:605) ~[je-5.0.104.jar:5.0.104]
              at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:396) ~[je-5.0.104.jar:5.0.104]
              at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:332) ~[je-5.0.104.jar:5.0.104]
              at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1506) ~[je-5.0.104.jar:5.0.104]
      

      The above causes the below

      BROKER-20 DEBUG [HttpManagement-http-74] o.a.q.s.m.p.f.ExceptionHandlingFilter Exception in servlet '/api/latest/virtualhostnode/nodetestInFlightTransactionsWhilstMajorityIsLost10004':
      org.apache.qpid.server.util.ConnectionScopedRuntimeException: Environment is restarting
              at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.getEnvironment(ReplicatedEnvironmentFacade.java:1342) ~[qpid-bdbstore-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.getNodes(ReplicatedEnvironmentFacade.java:965) ~[qpid-bdbstore-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.virtualhostnode.berkeleydb.BDBHAVirtualHostNodeImpl.activate(BDBHAVirtualHostNodeImpl.java:361) ~[qpid-bdbstore-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.virtualhostnode.AbstractVirtualHostNode.doActivate(AbstractVirtualHostNode.java:160) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.7.0_80]
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.7.0_80]
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_80]
              at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_80]
              at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1308) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1287) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.model.AbstractConfiguredObject.attainStateIfOpenedOrReopenFailed(AbstractConfiguredObject.java:1271) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.model.AbstractConfiguredObject.access$1700(AbstractConfiguredObject.java:80) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.model.AbstractConfiguredObject$15.execute(AbstractConfiguredObject.java:1519) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.model.AbstractConfiguredObject$15.execute(AbstractConfiguredObject.java:1455) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:561) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:554) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$TaskLoggingWrapper.execute(TaskExecutorImpl.java:270) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper$1.run(TaskExecutorImpl.java:342) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at java.security.AccessController.doPrivileged(Native Method) ~[na:1.7.0_80]
              at javax.security.auth.Subject.doAs(Subject.java:356) ~[na:1.7.0_80]
              at org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper.call(TaskExecutorImpl.java:335) ~[qpid-broker-core-6.1.0-SNAPSHOT.jar:6.1.0-SNAPSHOT]
              at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_80]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_80]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_80]
              at java.lang.Thread.run(Thread.java:745) [na:1.7.0_80]
      

      Test MultiNodeTest.testInFlightTransactionsWhilstMajorityIsLost sporadically fails because of this issue.

      Attachments

        1. QPID-7192-wip.diff
          21 kB
          Alex Rudyy

        Activity

          People

            Unassigned Unassigned
            orudyy Alex Rudyy
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: