[AMQ-7455] Blocking XA Periodic Recovery and Container Operations when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4 - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 5.15.11, 5.15.12
Fix Version/s: None
Component/s: RAR
Labels:
- newbie
- performance

Description

Scenario: deploy ActiveMQ RA to EAP6.4 with XA enabled DataSources for JDBC and ActiveMQ Artemis as JMS broker in a failover config (active/passive) and run several applications that use JMS and JDBC datasources. (In our scenario, there are 4 XA JDBC DataSources and 42 deployed WAR files). ActiveMQ resource adapter is used as the Artemis one is not compatible to this EAP version.

JBoss EAP and Artemis are running on the same host in the examples below, but this does not matter. Artemis is running with example HA-by-network-replication settings on localhost:61616 and localhost:61617.

The resource adapter settings on JBoss EAP side and on client side (especially properties of the broker URIs for ActiveMQ's connection factory) are exactly as described in Red Hat's reference documentation for HA config:

Add ?jms.rmIdFromConnectionId=true to the connection URI
<is-same-rm-override>false</is-same-rm-override> in standalone.xml

Observed Symptoms: On a regular basis, undeployment of WAR files take minutes (around 10) instead of seconds, blocking other container management operations during this timeframe. This behavior can be traced back to running ActiveMQ RA and Artemis as JMS broker. If running the same WAR files and same config, but using JBoss EAP's internal HornetQ broker, these delays do not occur and undeployments take mere seconds.

Further Investigation: This only affects JavaEE modules that define at least one MDB (message driven bean). During shutdown of the beans, the container management thread is blocked while de-registering the bean's XA Resource, see following stack trace taken while a simple module is being undeployed for reference:

"ServerService Thread Pool -- 277" #3709 prio=5 os_prio=0 tid=0x000000001d6e0000 nid=0x4df4 in Object.wait() [0x000000007c41e000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:502)
	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.waitForScanState(XARecoveryModule.java:935)
	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.removeXAResourceRecoveryHelper(XARecoveryModule.java:92)
	- locked <0x000000008162bd38> (a java.util.concurrent.atomic.AtomicInteger)
	at com.arjuna.ats.jbossatx.jta.RecoveryManagerService.removeXAResourceRecovery(RecoveryManagerService.java:117)
	at org.jboss.jca.core.tx.jbossts.XAResourceRecoveryRegistryImpl.removeXAResourceRecovery(XAResourceRecoveryRegistryImpl.java:63)
	at org.jboss.jca.core.rar.EndpointImpl.deactivate(EndpointImpl.java:226)
	at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.deactivate(MessageDrivenComponent.java:224)
	at org.jboss.as.ejb3.component.messagedriven.MessageDrivenComponent.stop(MessageDrivenComponent.java:195)
// and a couple of more elements

During these 10 minutes of blocking while the module is supposed to be undeployed, EAP's log is literally spammed with thousands of these lines:

11:19:43,640 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,643 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,647 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,657 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,661 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616
11:19:43,664 INFO  [org.apache.activemq.transport.failover.FailoverTransport] (ActiveMQ Task-1) Successfully connected to tcp://localhost:61616

Only after these lines stop to be logged the undeployment process continues.

What happens during this time in ActiveMQ: Obviously there are thousands of connections established to Artemis during this time. Tracing back the code path that leads to these calls points to EAP's Periodic XA Resource Recovery thread, see the following full stack trace (taken at the same point in time):

"Periodic Recovery" #96 prio=5 os_prio=0 tid=0x000000002484a000 nid=0x1b28 waiting for monitor entry [0x0000000028dbe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.lang.Object.wait(Native Method)
	at org.apache.activemq.transport.failover.FailoverTransport.oneway(FailoverTransport.java:623)
	- locked <0x00000000cc69eeb8> (a java.lang.Object)
	at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
	at org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
	at org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:86)
	at org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1392)
	at org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent(ActiveMQConnection.java:1486)
	- locked <0x00000000cc69ff10> (a java.lang.Object)
	at org.apache.activemq.ActiveMQConnection.start(ActiveMQConnection.java:527)
	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.newConnection(ActiveMQResourceAdapter.java:361)
	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.getResourceManagerId(ActiveMQResourceAdapter.java:278)
	at org.apache.activemq.TransactionContext.isSameRM(TransactionContext.java:666)
	at org.apache.activemq.ra.ActiveMQResourceAdapter$2.isSameRM(ActiveMQResourceAdapter.java:261)
	at org.jboss.jca.core.tx.jbossts.XAResourceWrapperImpl.isSameRM(XAResourceWrapperImpl.java:159)
	at com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.isSameRM(RecoveryXids.java:143)
	at com.arjuna.ats.internal.jta.recovery.arjunacore.RecoveryXids.updateIfEquivalentRM(RecoveryXids.java:215)
	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.refreshXidScansForEquivalentXAResourceImpl(XARecoveryModule.java:796)
	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass(XARecoveryModule.java:570)
	at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkFirstPass(XARecoveryModule.java:181)
	- locked <0x000000008162ba98> (a com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule)
	at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:747)
	at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:375)

From this stack trace I want to point out especially this line: org.apache.activemq.TransactionContext.isSameRM()

This method is part of the XA Resource Contract. Simply put the periodic recovery process iterates over XA Resources and looks for dormant XA transactions that need to be finished in some way or the other. Here it checks different combinations of XA Resources to asses whether they came from the same Resource Manager (.isSameRM()).
During this recovery process (which in the stack trace above is still in the phase of gathering basic information and not actually working on XA transactions) the deregistration of the message driven beans is blocked.

Analysis for the super long duration of recovery process: The XAResource instances that periodic recovery works on are created by org.apache.activemq.ra.ActiveMQResourceAdapter.getXAResources(ActivationSpec[]). The anonymous class that extends org.apache.activemq.TransactionContext and that is instantiated here does basically the following steps for each invocation of any of the XA Resource contract methods:

Replace the current connection to Artemis with a freshly created one
Call super
Restore the current connection and close the new one

Specifically for isSameRM this results in the following chain of events:

Create a pristine connection to remote Artemis (in our case a FailoverTransport)
Call super.isSameRM(xaresource)
If xaresource is null or is not an instance of TransactionContext, return false (org/apache/activemq/TransactionContext.java:658)
Get the current resource manager id and xaresource's resource manager id. This effectively calls this.getResourceManagerId(). As noted above, this leads to the following calls:
Replace the current connection to Artemis (established in step 1) with a second pristine one
Call super.getResourceManagerId()
Call getResourceManagerId() on the connection of step 5
In org/apache/activemq/ActiveMQConnection.java:1622, the resource manager id effectively is the Connection Id in our setup (according to the flag set in the connection URI).
This id is unique for each connection (see org/apache/activemq/ActiveMQConnection.java:240)
Compare the resulting value with the resource manager id of xaresource and return the result
Close the connection to the JMS broker established in step 5 and set the one established in step 1 as "current" connection
Close the connection established in step 1 and reset current connection to whatever was there before (in my understanding that is always null as the instances are always being freshly created inside ActiveMQResourceAdapter.getXAResources)

This is repeated for each pair of XA Resources, or rather Activation Specs, that the recovery manager encounters. The log spamming stems from steps 1 and 5 because each time a INFO message is logged tht tells us that a new connection has been established to the server.

This also happens when the recovery process wants to know if a JDBC connection to our SQL server (which we have plenty of due to several modules with a connection pooling data source each) is from the same resource manager as a JMS connection to Artemis - which it obviously never is.

Problematic Aspects of this code:

It doesn't make sense to establish pristine connections to the remote broker for each invocation of one of XAResource's methods. The connection setup is expensive with a plain non-failover URI as it is, but even more so with a failover transport - this has been supported by tests and also the code shows this.
If there is the option to find an answer to isSameRM() without having to do any work, this should be done as early as possible.
If the connection factory is configured with rmIdFromConnectionId enabled, then isSameRM() in my understanding will never return true.

Possible solutions for each of these aspects:

In order to change the basic pattern of create-connection-then-do-something-then-close-connection that happens inside ActiveMQResourceAdapter.getXAResources(), a big portion of this code would propably have to be re-written.
In the current code base, the instanceof check (see step 3 above) can be extraced and inlined into the anonymous inner class defined inside ActiveMQResourceAdapter.getXAResources()
I don't know about the semantic correctness of this, i.e. if this behavior is desired; but if it is, and all the code I read strengthens this assumption, then in this case this is another candidate for short-circuting the expensive connection creation: if this config flag is set on the connection factory, directly return false.

So, basically, the solutions 2 and 3 somewhat mitigate the antipattern described in 1. And only this aspect is addressed by the pull request that is linked to this ticket.

Attachments

Issue Links

is fixed by

AMQ-5790 Huge number of TIME_WAIT connections observed while using activemq resource adapter with EAP6

Resolved

links to

GitHub Pull Request #519

Blocking XA Periodic Recovery and Container Operations when deploying ActiveMQ RA 5.15.x to JBoss EAP 6.4

Details

Description

Attachments

Issue Links

Activity

People

Dates

Time Tracking