Affects Version/s: 5.8.0
Fix Version/s: 5.8.0
Patch Info:Patch Available
I was diagnosing a deadlock issue in DiscoveryNetworkConnector and noticed that during one of the tests, concurrent calls were being made to DiscoveryNetworkConnector.onServiceAdd(...) for the same DiscoveryEvent. This was unexpected because only a single service (URL) had been given to SimpleDiscoveryAgent. In fact, during one of the tests I observed dozens of concurrent calls.
Concurrent attempts to establish a bridge to the same remote broker are problematic because they expose a number of race conditions in DiscoveryNetworkConnector and RegionBroker that can lead to permanent bridge failure (see
AMQ-4160), as well as unnecessary thread pool execution/resource usage and logging.
The issues with DiscoveryNetworkConnector and RegionBroker will be filed as separate issues. This issue specifically addresses the bug that causes SimpleDiscoveryAgent to uncontrollably multiply bridge connection attempts.
When DemandForwardingBridgeSupport handles exceptions from either the local or remote sides of the the bridge, it fires a "bridge failed" event:
DiscoveryNetworkConnector is the NetworkBridgeListener, and its bridgeFailed() method calls back to SimpleDiscoveryAgent.serviceFailed(...):
In response, SimpleDiscoveryAgent.serviceFailed(...) pauses for the reconnectDelay before attempting to re-establish the bridge via DiscoveryNetworkConnector.onServiceAdd(...):
NOTE: the call to listener.onServiceAdd(...) is made by a new thread!
There are two race conditions that allow SimpleDiscoveryAgent.serviceFailed(...) to launch more than one thread, each attempting to re-restablish the same bridge.
First, note that DemandForwardingBridgeSupport.serviceLocal/RemoteException(...) launches a separate thread that stops the bridge:
When the bridge stops, the disposed flag is set, which prevents subsequent calls to serviceLocal/RemoteException(...) from calling fireBridgeFailed(). However, since the call to DemandForwardingBridgeSupport.stop() is made by a separate thread, multiple serviceLocal/RemoteException(...) calls that are made in quick succession can result in multiple calls to fireBridgeFailed().
This is the first race condition: multiple calls can be made to DiscoveryNetworkConnector.bridgeFailed() for the same bridge. By transitivity, this results in multiple calls to SimpleDiscoveryAgent.serviceFailed(...).
SimpleDiscoveryAgent.serviceFailed(...) has a guard class, event.failed.compareAndSet(false, true), which should only allow the first call to launch a bridge reconnect thread. However, once the reconnectDelay expires, event.failed is reset to false, which allows re-entry to the failure handling logic, and the possibile launching of additional bridge reconnect threads if the reconnectDelay is short or the threads calling serviceFailed(...) are delayed.
This is the second race condition: the guard clause in SimpleDiscoveryAgent.serviceFailed(...) can be reset before the subsequent redundant calls have been filtered out.
These two race conditions allow a single call to DiscoveryNetworkConnector.onServiceAdd(...) to result in multiple subsequent concurrent (re)calls, and these concurrent calls can spawn their own multiple concurrent calls. The result can be an unlimited number of concurrent calls to onServiceAdd(...).
The attached unit test demonstrates this bug by simulating a bridge failure that has yet to be detected by the remote broker (i.e., before the InactivityMonitor closes the connection). The local broker attempts to re-establish the bridge, but its call to DemandForwardingBridge.startRemoteBroker() fails because the remote broker rejects the new connection since the old one still exists. Since startRemoteBroker sends multiple messages to the remote broker, multiple exceptions are generated:
The multiple exceptions result in multiple calls to DemandForwardingBridgeSupport.serviceRemoteException(...), which allows the first race condition to be exhibited.
The first unit test has a 1s reconnectDelay, which is sufficient to make the second race condition improbable; therefore, this test generally passes.
The second unit test has a 0s reconnectDelay; on my system, this makes the timing of multiple calls to DemandForwardingBridgeSupport.serviceRemoteException(...) such that the second race condition is reliably exhibited, resulting in the unit test failing because it detects concurrent calls to DiscoveryNetworkConnector.onServiceAdd(...).
While it would be possible to add a failed.compareAndSet(false,true) guard clause to DemandForwardingBridgeSupport.fireBridgeFailed(), and prevent the first race condition from allowing multiple calls to SimpleDiscoveryAgent.serviceFailed(), the root problem is the race condition in serviceFailed. This can be trivially addressed by making a copy of the DiscoveryEvent, which prevents the original event.failed guard clause from being reset: