Consumers frequently fail to reconnect after broker outage/failover.



    • brokers 5.4.2, 5.6-snapshot and 5.5.1-fuse-03-06
      Apache.NMS.ActiveMQ v1.5.3
      Windows7 64bit, .Net 4
      Java 64bit 1.6.0_31, -server VM


      When using the failover transport we frequently see consumers fail to reconnect after failover between brokers or a broker outage. This behaviour is something we have been able to easily replicate in a test environment.

      The failure seems exagerated when connecting to a remote broker, we've tried to replicate it running the producers and consumers on the same local host but with no joy.

      Processes/connections that work purely as producers don't experience the same problem. In our tests frequently failing can cause all consumers to disconnect. The failure doesn't occurr unless broker is under load i.e. when producers must be active for failure to occur.

      The NMS client's connection threads and failover threads appear to have died/ended after a consumer fails to failover.

      Broker config is attached.

      Client connections use AsyncSend = true and have WatchTopicAdvisories = false (to avoid AMQNET-371). We've also tested this with and without DispatchAsync = true .

      Error messages - nothing obvious in NMS Trace or broker logs though we have on occasion whilst performing this test seen the messages detailed in AMQNET-370.

      Test code available at https://github.com/chillitom/NmsFailoverTest

      To run:

      • compile with VS2010
      • edit App.config in NmsFailoverTest project to point Broker1Address to a valid broker address on a different box.
      • configure Broker2Address to point to a second broker or a non-existent broker.
      • run the NmsFailoverTest project
      • producer stats will appear on console and periodically the broker will be queried for the number of consumer it sees, this will be printed to the console (assumes broker with <statisticsBrokerPlugin/> plugin)
      • repeatedly stop and start Broker1 and witness the decline in consumers after some failovers


