Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
-
Patch Available
Description
Symptom
=======
DiscoveryNetworkConnector is not thread-safe, and as a result, race conditions allow the bridges data structure to become corrupt and not accurately represent the bridges that exist.
Because bridges is used to control whether a discovery event will result in a bridge creation attempt, if it is not accurate, two results are possible:
- A discovery event will be ignored even though its corresponding bridge is not active; as a result, the bridge will never be established
- A discovery event will be processed even though its corresponding bridge is active; as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely retried, generating many WARN/ERROR log messages
Cause
=====
DiscoveryNetworkConnector updates a bridges hashmap whenever a bridge is created or removed:
public void onServiceAdd(DiscoveryEvent event) { ... // Should we try to connect to that URI? synchronized (bridges) { if( bridges.containsKey(uri) ) { if (LOG.isDebugEnabled()) { LOG.debug("Discovery agent generated a duplicate onServiceAdd event for: "+uri ); } return; } } ... NetworkBridge bridge = createBridge(localTransport, remoteTransport, event); try { bridge.start(); synchronized (bridges) { bridges.put(uri, bridge); } ... } public void onServiceRemove(DiscoveryEvent event) { String url = event.getServiceName(); if (url != null) { URI uri; try { uri = new URI(url); } catch (URISyntaxException e) { LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax: " + e, e); return; } synchronized (bridges) { bridges.remove(uri); } } }
There are a number of problems:
- The check-and-set for adding an entry bridges is not atomic. As a result, concurrent attempts to add a bridge to the same URL can be allowed to proceed to bridge creation. Only one of the bridges will be established (the other will fail when it attempts to add duplicate connections); however, the failure of the second bridge will result in a call to onServiceRemove(...) that will remove the single/shared bridges entry.
- The bridge is started before an entry is added to bridges. Since bridges are multi-threaded, it is possible that an exception will be handled by a different thread at some time between bridge.start() and bridges.put(uri, bridge. In processing this exception, the thread will call onServiceRemove(..., which will remove the (non-existent) bridges entry. Subseqently, bridges.put(uri, bridge) is executed, and adds the entry to bridges even though it is now stale and represents a failed bridge. Subsequent attempts to restore the bridge will be ignored, and no active bridge will be created.
The lack of thread-safety in DiscoveryNetworkConnector is exacerbated by AMQ-4159, which can result in many concurrent attempts to establish a bridge to the same URL. AMQ-4159 also described how multiple calls can be made to onServiceRemove(...), which can result in mal-behaviour similar to the second case described above (i.e., unexpected removal of a bridge that is active).
Solution
========
The attached patch attempts to restore some thread-safety to DiscoveryNetworkConnector by making the check-and-set atomic and adding the entry to bridges prior to starting the bridge. An additional structure, activeEvents, keeps track of the event that represents the current attempt to establish a bridge to a given remote URL — this is used to prevent multiple calls to onServiceRemove(...) from removing a bridges entry that was not added by the corresponding discovery event.
This patch is a band aid to the design flaws in DiscoveryNetworkConnector, and a refactoring of the connector should be considered. In particular, there is a tight-coupling between DiscoveryNetworkConnector and SimpleDiscoveryAgent that is not evident through their interfaces. For example, DiscoveryNetworkConnector assumes that the call to discoveryAgent.serviceFailed(...) will result in a call back to DiscoveryNetworkConnector.onServiceRemove(...). The call to onServiceRemove(...) is necessary to clean up the resources used by the bridge, but that requirement is not explicit, and therefore easily missed.