[AMQ-6092] Clear Broker to Broker Connection Info At Startup - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 5.12.0
Fix Version/s: None
Component/s: LevelDB
Labels:
None
Environment:

Linux

Description

This is a very difficult bug to describe, and an even tougher bug to replicate, so I guess I'll start by describing the circumstances that triggered this bug.

At each of 3 data centers I have replicated leveldb ActiveMQ cluster. There are store and forward connections between each data center. Phoenix has non-duplexed connections to Amsterdam and Ashburn, and in turn each of those sites has connections to the others. This makes a mesh type topography. Within a single datacenter, I have 3 copies of each broker using the replicated LevelDB feature in a kind of active/passive/passive configuration.

This is just a PoC setup, sitting on VMware infrastructure, and it sat idle for quite some time. At some point, while it was sitting idle, we had a storage maintenance, which caused a storage disconnect in Ashburn and Amsterdam. A storage disconnect is akin to just pulling the disk out of the box. Needless to say, AMQ didn't like this one bit. However, surviving a storage disconnect isn't really the point of the bug. The bug came in to play when I tried restarting the cluster after storage was restored.

I restarted each of the VMs, and began to bring the ActiveMQ instances back online, starting zookeeper, then starting ActiveMQ. After bringing each replicated LevelDB group back up, they refused to reconnect to each other via the store & forward connections. I kept getting this error:

Failed to add Connection ams1-1->ash1-1-38769-1450213134683-58409:1 due to javax.jms.InvalidClientIDException: Broker: ams1-1 - Client: ams1-1_ash1-1_queues_ash1-1_inbound_ams1-1 already connected from vm://ams1-1#0 | org.apache.activemq.broker.TransportConnection | triggerStartAsyncNetworkBridgeCreation: remoteBroker=unconnected, localBroker= vm://ams1-1#58408

Not a single broker would connect to another broker, and the messages imply that these connections already existed. However, I could see that the connections were trying to be established, using netstat, and the fact that this message occured over and over, like they were retrying. However, the web-based admin console showed nothing under Network. Not a single real connection was made.

After a lot of troubleshooting, especially looking into the LDAP Authentication/Authorization settings and mechanism, I finally figured that it had to be something persisted, because this exact same setup, without a single configuration change, had been working perfectly before the storage disconnect.

In the end, I ended up completely deleting the LevelDB directory, and restarting ActiveMQ on each node, and the setup is working flawlessly once again.

I haven't yet tried 5.13.0, and I'm pretty sure management isn't going to allow me to cause a storage disconnect so I can test it, but I have a feeling that some information about store & forward connections is stored in the persistent store, and some sort of short-write occurred when the storage disconnect happened. However, since this data, whatever it may be, wasn't cleared or reset at broker startup, the broker erroneously believed that the connections I was trying to establish already existed.

This may be an incorrect assumption, but at startup, the broker should reset any data it has that pertains to store and forward connections, because there's no way anything can really be connected at that time.

I'll attach my configurations so that the environment, if not the storage disconnect, can be replicated.

The steps to reproduce, if they were practical would be:

1.) Setup an AMQ store & forward mesh based on the attached configurations, and on VMWare ESX infrastructure.
2.) Cause a storage interruption.
3.) Reboot the VMs running AMQ to reset the read-only state of the block devices, after the storage interruption.
4.) Try to bring the cluster back online.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Manage Attachments

activemq-configurations.tar.gz
17/Dec/15 00:17
5 kB
John Anderson

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: John Anderson

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Dec/15 00:01

Updated:: 06/Feb/17 14:31

Resolved:: 06/Feb/17 14:31

Agile

View on Board

Clear Broker to Broker Connection Info At Startup

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment