Uploaded image for project: 'ActiveMQ Classic'
  1. ActiveMQ Classic
  2. AMQ-7455

Blocking XA Periodic Recovery and Container Operations when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 5.15.11, 5.15.12
    • None
    • RAR

    Description

      Scenario: deploy ActiveMQ RA to EAP6.4 with XA enabled DataSources for JDBC and ActiveMQ Artemis as JMS broker in a failover config (active/passive) and run several applications that use JMS and JDBC datasources. (In our scenario, there are 4 XA JDBC DataSources and 42 deployed WAR files). ActiveMQ resource adapter is used as the Artemis one is not compatible to this EAP version.

      JBoss EAP and Artemis are running on the same host in the examples below, but this does not matter. Artemis is running with example HA-by-network-replication settings on localhost:61616 and localhost:61617.

      The resource adapter settings on JBoss EAP side and on client side (especially properties of the broker URIs for ActiveMQ's connection factory) are exactly as described in Red Hat's reference documentation for HA config:

      • Add ?jms.rmIdFromConnectionId=true to the connection URI
      • <is-same-rm-override>false</is-same-rm-override> in standalone.xml

      Observed Symptoms: On a regular basis, undeployment of WAR files take minutes (around 10) instead of seconds, blocking other container management operations during this timeframe. This behavior can be traced back to running ActiveMQ RA and Artemis as JMS broker. If running the same WAR files and same config, but using JBoss EAP's internal HornetQ broker, these delays do not occur and undeployments take mere seconds.

      Further Investigation: This only affects JavaEE modules that define at least one MDB (message driven bean). During shutdown of the beans, the container management thread is blocked while de-registering the bean's XA Resource, see following stack trace taken while a simple module is being undeployed for reference:

      "ServerService Thread Pool -- 277" #3709 prio=5 os_prio=0 tid=0x000000001d6e0000 nid=0x4df4 in Object.wait() [0x000000007c41e000]
         java.lang.Thread.State: WAITING (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	at java.lang.Object.wait(Object.java:502)
      	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.waitForScanState(XARecoveryModule.java:935)
      	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.removeXAResourceRecoveryHelper(XARecoveryModule.java:92)
      	- locked <0x000000008162bd38> (a java.util.concurrent.atomic.AtomicInteger)
      	at com.arjuna.ats.jbossatx.jta.RecoveryManagerService.removeXAResourceRecovery(RecoveryManagerService.java:117)
      	at org.jboss.jca.core.tx.jbossts.XAResourceRecoveryRegistryImpl.removeXAResourceRecovery(XAResourceRecoveryRegistryImpl.java:63)
      	at org.jboss.jca.core.rar.EndpointImpl.deactivate(EndpointImpl.java:226)
      	at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.deactivate(MessageDrivenComponent.java:224)
      	at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.stop(MessageDrivenComponent.java:195)
      // and a couple of more elements
      

      During these 10 minutes of blocking while the module is supposed to be undeployed, EAP's log is literally spammed with thousands of these lines:

      11:19:43,640 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
      11:19:43,643 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
      11:19:43,647 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
      11:19:43,657 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
      11:19:43,661 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
      11:19:43,664 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
      

      Only after these lines stop to be logged the undeployment process continues.

      What happens during this time in ActiveMQ: Obviously there are thousands of connections established to Artemis during this time. Tracing back the code path that leads to these calls points to EAP's Periodic XA Resource Recovery thread, see the following full stack trace (taken at the same point in time):

      "Periodic Recovery" #96 prio=5 os_prio=0 tid=0x000000002484a000 nid=0x1b28 waiting for monitor entry [0x0000000028dbe000]
         java.lang.Thread.State: BLOCKED (on object monitor)
      	at java.lang.Object.wait(Native Method)
      	at org.apache.activemq.transport.failover.FailoverTransport.oneway(FailoverTransport.java:623)
      	- locked <0x00000000cc69eeb8> (a java.lang.Object)
      	at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
      	at org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
      	at org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:86)
      	at org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1392)
      	at org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent(ActiveMQConnection.java:1486)
      	- locked <0x00000000cc69ff10> (a java.lang.Object)
      	at org.apache.activemq.ActiveMQConnection.start(ActiveMQConnection.java:527)
      	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.newConnection(ActiveMQResourceAdapter.java:361)
      	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.getResourceManagerId(ActiveMQResourceAdapter.java:278)
      	at org.apache.activemq.TransactionContext.isSameRM(TransactionContext.java:666)
      	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.isSameRM(ActiveMQResourceAdapter.java:261)
      	at org.jboss.jca.core.tx.jbossts.XAResourceWrapperImpl.isSameRM(XAResourceWrapperImpl.java:159)
      	at com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.isSameRM(RecoveryXids.java:143)
      	at com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.updateIfEquivalentRM(RecoveryXids.java:215)
      	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.refreshXidScansForEquivalentXAResourceImpl(XARecoveryModule.java:796)
      	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass(XARecoveryModule.java:570)
      	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkFirstPass(XARecoveryModule.java:181)
      	- locked <0x000000008162ba98> (a com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule)
      	at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:747)
      	at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:375)

      From this stack trace I want to point out especially this line: org.apache.activemq.TransactionContext.isSameRM()

      This method is part of the XA Resource Contract. Simply put the periodic recovery process iterates over XA Resources and looks for dormant XA transactions that need to be finished in some way or the other. Here it checks different combinations of XA Resources to asses whether they came from the same Resource Manager (.isSameRM()).
      During this recovery process (which in the stack trace above is still in the phase of gathering basic information and not actually working on XA transactions) the deregistration of the message driven beans is blocked.

      Analysis for the super long duration of recovery process: The XAResource instances that periodic recovery works on are created by org.apache.activemq.ra.ActiveMQResourceAdapter.getXAResources(ActivationSpec[]). The anonymous class that extends org.apache.activemq.TransactionContext and that is instantiated here does basically the following steps for each invocation of any of the XA Resource contract methods:

      1. Replace the current connection to Artemis with a freshly created one
      2. Call super
      3. Restore the current connection and close the new one

      Specifically for isSameRM this results in the following chain of events:

      1. Create a pristine connection to remote Artemis (in our case a FailoverTransport)
      2. Call super.isSameRM(xaresource)
      3. If xaresource is null or is not an instance of TransactionContext, return false (org/apache/activemq/TransactionContext.java:658)
      4. Get the current resource manager id and xaresource's resource manager id. This effectively calls this.getResourceManagerId(). As noted above, this leads to the following calls:
      5. Replace the current connection to Artemis (established in step 1) with a second pristine one
      6. Call super.getResourceManagerId()
      7. Call getResourceManagerId() on the connection of step 5
      8. In org/apache/activemq/ActiveMQConnection.java:1622, the resource manager id effectively is the Connection Id in our setup (according to the flag set in the connection URI).
      9. This id is unique for each connection (see org/apache/activemq/ActiveMQConnection.java:240)
      10. Compare the resulting value with the resource manager id of xaresource and return the result
      11. Close the connection to the JMS broker established in step 5 and set the one established in step 1 as "current" connection
      12. Close the connection established in step 1 and reset current connection to whatever was there before (in my understanding that is always null as the instances are always being freshly created inside ActiveMQResourceAdapter.getXAResources)

      This is repeated for each pair of XA Resources, or rather Activation Specs, that the recovery manager encounters. The log spamming stems from steps 1 and 5 because each time a INFO message is logged tht tells us that a new connection has been established to the server.

      This also happens when the recovery process wants to know if a JDBC connection to our SQL server (which we have plenty of due to several modules with a connection pooling data source each) is from the same resource manager as a JMS connection to Artemis - which it obviously never is.

      Problematic Aspects of this code:

      1. It doesn't make sense to establish pristine connections to the remote broker for each invocation of one of XAResource's methods. The connection setup is expensive with a plain non-failover URI as it is, but even more so with a failover transport - this has been supported by tests and also the code shows this.
      2. If there is the option to find an answer to isSameRM() without having to do any work, this should be done as early as possible.
      3. If the connection factory is configured with rmIdFromConnectionId enabled, then isSameRM() in my understanding will never return true.

      Possible solutions for each of these aspects:

      1. In order to change the basic pattern of create-connection-then-do-something-then-close-connection that happens insideĀ  ActiveMQResourceAdapter.getXAResources(), a big portion of this code would propably have to be re-written.
      2. In the current code base, the instanceof check (see step 3 above) can be extraced and inlined into the anonymous inner class defined inside ActiveMQResourceAdapter.getXAResources()
      3. I don't know about the semantic correctness of this, i.e. if this behavior is desired; but if it is, and all the code I read strengthens this assumption, then in this case this is another candidate for short-circuting the expensive connection creation: if this config flag is set on the connection factory, directly return false.

      So, basically, the solutions 2 and 3 somewhat mitigate the antipattern described in 1. And only this aspect is addressed by the pull request that is linked to this ticket.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              chammp Markus
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m