[ARTEMIS-2048] JCA RA does not failover to backup until TCP connect fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Reopened
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.6.2
Fix Version/s: None
Component/s: None
Labels:
- Failover
- HA
- JCA
- RAR
Environment:

Latest Payara (Full 182)

JDK 1.8

Windows machine (same on 7 x64 and 10 x64)

Description

In cluster configuration with HA replication and UDP broadcast discovery, when both master and backup are properly started and then process for master node is suspended on OS level (Windows), Artemis JCA resource adapter implementation does not properly recognize live being stuck and will not failover to backup until the moment when TCP connections to master will start to get refused.

If cluster connection on nodes is configured to use low enough timeouts, backup node is able to recognize the problem in meaningful time and become a live. JCA RA however will not connect to now new live for several minutes. It's because calls to

1094: createConnector()

1096: openTransportConnection(liveConnector)

org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection()

will not return null (which would be the signal to try to do failover) and thus attempt to communicate with stuck master will fail later at

911: clientProtocolManager.checkForFailover(liveNodeID)

org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection()

which causes errors when trying to use connection to broker (both explicit usage and MDBs).

Most of the time, JCA adapter eventually recognizes live not being there, do a failover and everything starts working again.

Several times, with my (other) prototype app, I was however able to get adapter stuck in a way that, even though slave (now live) was running just fine, either:

failover happened but not for MDBs somehow - app could explicitly publish messages (get new usable connection from pool), but MDBs were not consuming from queues anymore
failover did not happen at all and both publishing and consuming was not working anymore

For this I however don't have reliable reproduction steps yet.

The theory about TCP connections is supported by doing telnets to suspended master's port. For several minutes after suspend, telnet can connect just fine and it changes exactly when I see messages in server logs about doing failover to backup.

I've prepared small test app, having REST api to publish message to a queue (use included Swagger UI pages) and MDB consuming from the queue.

On below link you can find source code of the app, scripts for creating master and slave brokers locally, parts of broker.xml config files with required config, resources required to setup Payara. Also patch tracking changes I've made to artemis RA & RAR projects code to make it to run in Payara

https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR

(app needs "test.input" addesss+queue created beforehand, since MDB consumer does not create it automatically, and sets log level for "org.apache.activemq.artemis" to ALL)

Attachments

Issue Links

duplicates

ARTEMIS-2084 Failover won't happen when net cable is disconnected unless Netty Timeout is specified

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jozef Tomek

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Aug/18 09:34

Updated:: 03/Jan/20 23:42