Uploaded image for project: 'ActiveMQ Classic'
  1. ActiveMQ Classic
  2. AMQ-6092

Clear Broker to Broker Connection Info At Startup

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 5.12.0
    • None
    • LevelDB
    • None
    • Linux

    Description

      This is a very difficult bug to describe, and an even tougher bug to replicate, so I guess I'll start by describing the circumstances that triggered this bug.

      At each of 3 data centers I have replicated leveldb ActiveMQ cluster. There are store and forward connections between each data center. Phoenix has non-duplexed connections to Amsterdam and Ashburn, and in turn each of those sites has connections to the others. This makes a mesh type topography. Within a single datacenter, I have 3 copies of each broker using the replicated LevelDB feature in a kind of active/passive/passive configuration.

      This is just a PoC setup, sitting on VMware infrastructure, and it sat idle for quite some time. At some point, while it was sitting idle, we had a storage maintenance, which caused a storage disconnect in Ashburn and Amsterdam. A storage disconnect is akin to just pulling the disk out of the box. Needless to say, AMQ didn't like this one bit. However, surviving a storage disconnect isn't really the point of the bug. The bug came in to play when I tried restarting the cluster after storage was restored.

      I restarted each of the VMs, and began to bring the ActiveMQ instances back online, starting zookeeper, then starting ActiveMQ. After bringing each replicated LevelDB group back up, they refused to reconnect to each other via the store & forward connections. I kept getting this error:

      Failed to add Connection ams1-1->ash1-1-38769-1450213134683-58409:1 due to javax.jms.InvalidClientIDException: Broker: ams1-1 - Client: ams1-1_ash1-1_queues_ash1-1_inbound_ams1-1 already connected from vm://ams1-1#0 | org.apache.activemq.broker.TransportConnection | triggerStartAsyncNetworkBridgeCreation: remoteBroker=unconnected, localBroker= vm://ams1-1#58408

      Not a single broker would connect to another broker, and the messages imply that these connections already existed. However, I could see that the connections were trying to be established, using netstat, and the fact that this message occured over and over, like they were retrying. However, the web-based admin console showed nothing under Network. Not a single real connection was made.

      After a lot of troubleshooting, especially looking into the LDAP Authentication/Authorization settings and mechanism, I finally figured that it had to be something persisted, because this exact same setup, without a single configuration change, had been working perfectly before the storage disconnect.

      In the end, I ended up completely deleting the LevelDB directory, and restarting ActiveMQ on each node, and the setup is working flawlessly once again.

      I haven't yet tried 5.13.0, and I'm pretty sure management isn't going to allow me to cause a storage disconnect so I can test it, but I have a feeling that some information about store & forward connections is stored in the persistent store, and some sort of short-write occurred when the storage disconnect happened. However, since this data, whatever it may be, wasn't cleared or reset at broker startup, the broker erroneously believed that the connections I was trying to establish already existed.

      This may be an incorrect assumption, but at startup, the broker should reset any data it has that pertains to store and forward connections, because there's no way anything can really be connected at that time.

      I'll attach my configurations so that the environment, if not the storage disconnect, can be replicated.

      The steps to reproduce, if they were practical would be:

      1.) Setup an AMQ store & forward mesh based on the attached configurations, and on VMWare ESX infrastructure.
      2.) Cause a storage interruption.
      3.) Reboot the VMs running AMQ to reset the read-only state of the block devices, after the storage interruption.
      4.) Try to bring the cluster back online.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            johna1 John Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment