ActiveMQ
  1. ActiveMQ
  2. AMQ-2541

Extremely slow broker startup when using SimpleDiscoveryAgent with an inactive Network of Brokers.

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 5.3.0
    • Fix Version/s: 5.x
    • Component/s: Broker, Connector
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Symptom
      ========
      An AMQ broker that is configured to join a statically-defined (i.e., using uri="static:(tcp://host1:61616,tcp://host2:61616,tcp://..) ) network of brokers can have an extremely long startup time (in the order of 5+ minutes), if many of the brokers in the network are not alive.

      The following log entires show the startup of an AMQ Broker (http://192.168.170.112:50000) that is configured to join a network with three other brokers:
      http://10.10.60.78:50000
      http://10.9.62.135:50000
      http://10.10.60.75:50000)

      The three other have not yet started.

      The log file shows that it takes nearly 4 minutes from BrokerService#start() to return control to the calling thread (AlarmPoint Node-main):

      2009-12-18 15:24:46,783 [AlarmPoint Node-main] INFO - - ActiveMQ 5.3.0 JMS Message Broker (localhost) is starting
      ...
      2009-12-18 15:24:47,158 [AlarmPoint Node-main] INFO - - Connector http://192.168.170.112:50000 Started
      2009-12-18 15:24:47,158 [AlarmPoint Node-main] INFO - - Establishing network connection from vm://localhost to http://10.10.60.78:50000
      ...
      2009-12-18 15:26:11,314 [AlarmPoint Node-main] WARN - - Could not start network bridge between: vm://localhost and: http://10.10.60.78:50000 due to: java.net.ConnectException: Connection timed out: connect
      2009-12-18 15:26:11,314 [AlarmPoint Node-main] DEBUG - - Start failure exception: java.net.ConnectException: Connection timed out: connect
      2009-12-18 15:26:11,314 [AlarmPoint Node-main] INFO - - Establishing network connection from vm://localhost to http://10.9.62.135:50000
      ...
      2009-12-18 15:27:35,299 [AlarmPoint Node-main] WARN - - Could not start network bridge between: vm://localhost and: http://10.9.62.135:50000 due to: java.net.ConnectException: Connection timed out: connect
      2009-12-18 15:27:35,299 [AlarmPoint Node-main] DEBUG - - Start failure exception: java.net.ConnectException: Connection timed out: connect
      2009-12-18 15:27:35,299 [AlarmPoint Node-main] INFO - - Establishing network connection from vm://localhost to http://10.10.60.75:50000
      ...
      2009-12-18 15:28:59,314 [AlarmPoint Node-main] WARN - - Could not start network bridge between: vm://localhost and: http://10.10.60.75:50000 due to: java.net.ConnectException: Connection timed out: connect
      2009-12-18 15:28:59,314 [AlarmPoint Node-main] DEBUG - - Start failure exception: java.net.ConnectException: Connection timed out: connect
      2009-12-18 15:28:59,314 [AlarmPoint Node-main] INFO - - Network Connector bridge Started
      2009-12-18 15:28:59,314 [AlarmPoint Node-main] INFO - - ActiveMQ JMS Message Broker (localhost, ID:vic-esx4-ns1-1280-1261178686846-0:0) started

      Cause
      =====
      The broker's network connector is implemented by org.apache.activemq.network.DiscoveryNetworkConnector, which in turn uses org.apache.activemq.transport.discovery.simple.SimpleDiscoveryAgent to determine whether the URLs configured in uri="static:(tcp://host1:61616,tcp://host2:61616,tcp://..) " are active. SimpleDiscoveryAgent#start() has this loop:

      public void start() throws Exception {
      running.set(true);
      for (int i = 0; i < services.length; i++)

      { listener.onServiceAdd(new SimpleDiscoveryEvent(services[i])); }

      }

      "listener.onServiceAdd(...) " is called for each URL and is implemented by DiscoveryNetworkConnector#onServiceAdd(...). The main thread calls BrokerService#start() which calls DiscoveryNetworkConnector#start() which calls SimpleDiscoveryAgent#start(), which sequentially calls DiscoveryNetworkConnector#onServiceAdd(...). Since the URLs being "discovered" are inactive, DiscoveryNetworkConnector#onServiceAdd(...) blocks ~1m30s (this will depend on network configuration) for each URL. This blocks the main thread that is trying to start the broker. If there are several inactive URLs, then the blocking time becomes excessive.

      Solution
      =======
      If you follow through the DiscoveryNetworkConnector#onServiceAdd(...) method, it eventially calls SimpleDiscoveryAgent#serviceFailed(...) for each inactive URL. In turn SimpleDiscoveryAgent#serviceFailed(...) launches an asynchronous task that pauses for the configured reconnect delay, and then retries the call to DiscoveryNetworkConnector#onServiceAdd(...). So it must be safe to call DiscoveryNetworkConnector#onServiceAdd(...) concurrently. Therefore, SimpleDiscoveryAgent#start()'s loop should be changed to launch asynchronous tasks to make the DiscoveryNetworkConnector#onServiceAdd(...) calls concurrently rather than synchronously.

      This solution has the benefit of returning control immediately to the caller of SimpleDiscoveryAgent#start(...), thus starting the broker faster, and allows the network discovery to find "active" URLs much faster (i.e., with the sequential loop, if the "active" URL is the last one, its discovery is significantly delayed).

      1. AMQ2541.patch
        1 kB
        Stirling Chow
      2. AMQ2541Test.java
        4 kB
        Stirling Chow

        Activity

        Stirling Chow created issue -
        Hide
        Stirling Chow added a comment -

        Patch to make SimpleDiscoveryAgent#start(...) asynchronous.

        Show
        Stirling Chow added a comment - Patch to make SimpleDiscoveryAgent#start(...) asynchronous.
        Stirling Chow made changes -
        Field Original Value New Value
        Attachment patch.txt [ 18740 ]
        Hide
        Rob Davies added a comment -

        Hi Stirling,

        thanks for the patch - it doesn't look like its from trunk though (so finding it hard to see what you've changed) - could you get src from trunk and make the patch again ?

        thanks,.

        Rob

        Show
        Rob Davies added a comment - Hi Stirling, thanks for the patch - it doesn't look like its from trunk though (so finding it hard to see what you've changed) - could you get src from trunk and make the patch again ? thanks,. Rob
        Rob Davies made changes -
        Fix Version/s 5.4.1 [ 12332 ]
        Bruce Snyder made changes -
        Fix Version/s 5.4.1 [ 12332 ]
        Fix Version/s 5.5.0 [ 12344 ]
        Jeff Turner made changes -
        Project Import Fri Nov 26 22:32:02 EST 2010 [ 1290828722158 ]
        Dejan Bosanac made changes -
        Fix Version/s 5.5.0 [ 12315626 ]
        Fix Version/s 5.4.2 [ 12315625 ]
        Gary Tully made changes -
        Fix Version/s 5.6.0 [ 12316331 ]
        Fix Version/s 5.5.0 [ 12315626 ]
        Stirling Chow made changes -
        Attachment patch.txt [ 12461486 ]
        Stirling Chow made changes -
        Attachment AMQ2541.patch [ 12553417 ]
        Attachment AMQ2541Test.java [ 12553418 ]
        Hide
        Stirling Chow added a comment -

        Hi Rob,

        Sorry for the delay! I missed your comment, and we just ran into this issue again when we upgraded to AMQ 5.6.0. I've attached an updated patch based on the AMQ trunk. The patch uses the SimpleDiscoveryAgent's taskrunner to launch concurrent bridge creation attempts at startup. The unit test verifies that startup is concurrent. I also had to change EmbeddedBrokerTestSupport so that it properly initialized the auto-fail thread.

        Show
        Stirling Chow added a comment - Hi Rob, Sorry for the delay! I missed your comment, and we just ran into this issue again when we upgraded to AMQ 5.6.0. I've attached an updated patch based on the AMQ trunk. The patch uses the SimpleDiscoveryAgent's taskrunner to launch concurrent bridge creation attempts at startup. The unit test verifies that startup is concurrent. I also had to change EmbeddedBrokerTestSupport so that it properly initialized the auto-fail thread.
        Hide
        Gary Tully added a comment -

        how does this relate to https://issues.apache.org/jira/browse/AMQ-3195 which provides for async start up of network connectors. Are the two approaches competing or necessary?

        Show
        Gary Tully added a comment - how does this relate to https://issues.apache.org/jira/browse/AMQ-3195 which provides for async start up of network connectors. Are the two approaches competing or necessary?
        Hide
        Stirling Chow added a comment -

        This issue is complimentary to AMQ-3195. AMQ-3195 addressed the case where multiple network connectors started synchronously. Within a single network connector, the bridges were started synchronously. AMQ-2541 addresses the behaviour within a single network connector so that the bridges are started asynchronously. The rationale for AMQ-2541 is the same as AMQ-3195 — sometimes establishing a bridge can be very slow and holds up subsequent bridge formation.

        In our use case, we only have a single network connector, but it is configured with multiple URLs. Since this is a single network connector, AMQ-3195 does not help since the multiple URLs within the network connector are attempted synchronously. AMQ-2541 adds asynchronous connection establishment within the network connector.

        Show
        Stirling Chow added a comment - This issue is complimentary to AMQ-3195 . AMQ-3195 addressed the case where multiple network connectors started synchronously. Within a single network connector, the bridges were started synchronously. AMQ-2541 addresses the behaviour within a single network connector so that the bridges are started asynchronously. The rationale for AMQ-2541 is the same as AMQ-3195 — sometimes establishing a bridge can be very slow and holds up subsequent bridge formation. In our use case, we only have a single network connector, but it is configured with multiple URLs. Since this is a single network connector, AMQ-3195 does not help since the multiple URLs within the network connector are attempted synchronously. AMQ-2541 adds asynchronous connection establishment within the network connector.
        Hide
        nvalecha@cisco.com added a comment - - edited

        We are getting the same problem due to this defect.

        The primary node takes time in startup when other nodes are down.

        Please let me know if it is fixed or are you planning to fix it.

        Show
        nvalecha@cisco.com added a comment - - edited We are getting the same problem due to this defect. The primary node takes time in startup when other nodes are down. Please let me know if it is fixed or are you planning to fix it.
        nvalecha@cisco.com made changes -
        Comment [ We are getting the same problem due to this defect.

        Is it fixed or are you planning to fix it.

        ]
        Hide
        Gary Tully added a comment -

        with the fix in https://issues.apache.org/jira/browse/AMQ-3195 does the workaround, of using a configured network connector per url rather than static(a,b,c) work for you?
        This issue is not on the radar for 5.10 at the moment.

        Show
        Gary Tully added a comment - with the fix in https://issues.apache.org/jira/browse/AMQ-3195 does the workaround, of using a configured network connector per url rather than static(a,b,c) work for you? This issue is not on the radar for 5.10 at the moment.
        Hide
        nvalecha@cisco.com added a comment -

        Using networkConnectorStartAsync does not solves our problem.

        Show
        nvalecha@cisco.com added a comment - Using networkConnectorStartAsync does not solves our problem.

          People

          • Assignee:
            Unassigned
            Reporter:
            Stirling Chow
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development