Uploaded image for project: 'ActiveMQ Artemis'
  1. ActiveMQ Artemis
  2. ARTEMIS-2048

JCA RA does not failover to backup until TCP connect fails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 2.6.2
    • None
    • None
    • Latest Payara (Full 182)

      JDK 1.8

      Windows machine (same on 7 x64 and 10 x64)

    Description

      In cluster configuration with HA replication and UDP broadcast discovery, when both master and backup are properly started and then process for master node is suspended on OS level (Windows), Artemis JCA resource adapter implementation does not properly recognize live being stuck and will not failover to backup until the moment when TCP connections to master will start to get refused.

       

      If cluster connection on nodes is configured to use low enough timeouts, backup node is able to recognize the problem in meaningful time and become a live. JCA RA however will not connect to now new live for several minutes. It's because calls to

      1094: createConnector()
      
      1096: openTransportConnection(liveConnector)

      in 

      org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.createTransportConnection()

       

      will not return null (which would be the signal to try to do failover) and thus attempt to communicate with stuck master will fail later at

      911: clientProtocolManager.checkForFailover(liveNodeID)

      in 

      org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.getConnection()

       

      which causes errors when trying to use connection to broker (both explicit usage and MDBs).

       

      Most of the time, JCA adapter eventually recognizes live not being there, do a failover and everything starts working again.

      Several times, with my (other) prototype app, I was however able to get adapter stuck in a way that, even though slave (now live) was running just fine, either:

      • failover happened but not for MDBs somehow - app could explicitly publish messages (get new usable connection from pool), but MDBs were not consuming from queues anymore
      • failover did not happen at all and both publishing and consuming was not working anymore

      For this I however don't have reliable reproduction steps yet.

      The theory about TCP connections is supported by doing telnets to suspended master's port. For several minutes after suspend, telnet can connect just fine and it changes exactly when I see messages in server logs about doing failover to backup.

       

      I've prepared small test app, having REST api to publish message to a queue (use included Swagger UI pages) and MDB consuming from the queue.

      On below link you can find source code of the app, scripts for creating master and slave brokers locally, parts of broker.xml config files with required config, resources required to setup Payara. Also patch tracking changes I've made to artemis RA & RAR projects code to make it to run in Payara

      https://drive.google.com/open?id=11DNBCLKfAwttfibDw0Ckm_mVVhXP2JiR

      (app needs "test.input" addesss+queue created beforehand, since MDB consumer does not create it automatically, and sets log level for "org.apache.activemq.artemis" to ALL)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jtomek Jozef Tomek
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: