There is a bug in the FailoverTransport which is triggered by a race condition. The client log contains message:
WARN | ActiveMQ Transport: URI1 [FailoverTransport] Transport (URI2) failed, attempting to automatically reconnect
The exact impact on client failover differs with each setup and environment. In our case this forced client to infinitely switch between two available brokers.
Assume client is configured to use broker URL in form
Assume that broker with URI1 is down and the other broker URI2 is running fine. This is normal master/slave setup.
Client tries to establish connection and the following happens:
1. URI1 is tried, it fails because this broker is not reachable (down or waiting slave)
2. URI2 is tried, it succeeds because this broker is currently the 'master'
3. Exception from thread of transport to URI1 causes failure in transport to URI2
4. Try another transport in the list. Oh wait, its URI1 -> go to 1.
Impact for different configurations might not be that severe. But unfortunately in our case we were not able to avoid this bug no matter the configuration. For example randomize=true helped a little, but still the inifinite loop happens 1/2 of the time.
The bug is caused by a single shared instance myTransportListener of TransportListener in FailoverTransport class. doReconnect() tries to start transport to URI1 and registers the listener on it. Transport fails to start and the next transport to URI2 is tried. But the listener is not unregistered from the failed transport URI1. Failures that happen on transport URI1 may call in its own thread the listener method onException(). This call will get to handleTransportFailure() where it waits for the reconnectMutex. The reconnect task thread continues, establishes Transport URI2, sets it to connectedTransport=URI2, releases the reconnectMutex. The thread of transport URI1 unblocks in handleTransportFailure() and destroys the connectedTransport=URI2.
I have created a patch against version 5.11 that deals specifically with this problem.
The change is that instead of the single shared myTrasnportListener instance there is a new listener created for each new transport.
Each new listener keeps reference to the transport it was assigned to. The listener will cause failover only if the exception is coming from the transport which is currently connected.
I didn't care about the other methods of the listener, but these probably need the same restriction.
This bug is present in all versions from version 4.0 (I didn't go deeper). The idea in the patch should be applicable for all versions.
Btw. log message mentioned in
AMQ-4986 contains the same URI1 vs URI2 problem.