[QPID-7655] It takes about 15 minutes for a push route (bridge) to restore communications with a remote broker upon network recovery - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: qpid-cpp-1.35.0, qpid-cpp-1.36.0
Fix Version/s: None
Component/s: C++ Broker
Labels:
None
Environment:

Linux C++ broker with a bridge (push route) to a remote C++ broker
and the network connection to the remote broker can be turned off and on

Flags:

Important

Description

We have the following use case and issue:
1. a Linux C++ broker is used on a mobile device.
2. there is a push route (bridge) between a mobile Linux C++ broker and a remote broker established using the following:
qpid-route queue add -s -d -v --ack 1 --credit 100 remoteBroker mobileBroker remoteExchange mobileQueue
3. the network connectivity for the mobile device is intermittent which means that the mobile device could be out of communication and then be in communication for some time. We would like that the qpid broker on the mobile device can recover from communication losses as soon as possible when it transitions from out of communication to in communication. However in the current normal running environment, when this happens, it takes about 15 minutes for the mobile broker to recover communications with the remote broker.

To re-create the issue, one must:
1. run the mobile broker with qpid logging set to info+.
2. using qpid-send to send some messages to the mobile broker periodically say every second (to simulate messages put to the mobileQueue to be routed to the remote broker).

Then here is what it looked like

The network connection was turned off 2-3 minutes before 2017-01-24 21:20:03

then:

2017-01-24 21:20:03 [Protocol] error Federation link connection qpid.tcp:remoteBroker:5672 missed 2 heartbeats - closing connection
..........

Network connection was actually restored around 2017-01-24 21:21:00. However the inter-broker link could only be restored 15+ minutes later:
..........
2017-01-24 21:36:25 [System] error Error reading socket: Invalid argument(22)
2017-01-24 21:36:25 [Broker] info Inter-broker link disconnected from remoteBroker:5672 Closed by peer (Note: this disconnect should happen at 2017-01-24 21:20:03 when the connection was closed after 2 heartbeats were missed)
2017-01-24 21:36:25 [System] info Connecting: remoteBroker:5672
2017-01-24 21:36:25 [Broker] info Inter-broker link connecting to remoteBroker:5672
2017-01-24 21:36:25 [Broker] info Inter-broker link established to remoteBroker:5672

So the heartbeats triggered the connection close at 21:20:03 as expected. However, it almost seems as if the connection didn't close properly. There was nothing in the log after the "closing connection" until the socket error occured 15+ minutes later. No attempts to reconnect, nothing until the socket error. Network connection was back for 15 minutes by the time the connection closed at 21:36:25. It then connected promptly and the messages started being routed out to the remote broker within 5-10 seconds of the link succeeding,

The following is my analysis of the root cause from C++ source code. There is also a workaround. But the bug needs to be fixed:
--------

Here is a bit more detailed issue why qpid broker takes 15+ minutes to close a connection which caused message delay if a device is out of the comms and then in comms again:

-------------
The issue was that in /qpid/sys/posix/AsynchIO.cpp, the following function can only be called if there is a writeable socket event

void AsynchIO::writeable(DispatchHandle& h)

In the middle of the function there was a place that DispatchHandler will be closed by

close(h);

In the same file, there was another function

void AsynchIO::queueWriteClose()

In this function it will set a flag of queuedClose to true when there was a communication loss. The flag will trigger close(h) in the aforementioned function of writeable(DispatchHandle& h) when the writeable() function is called next time. However if this function is not triggered, close(h) will not be called until 15 mins later the socket error happens.

The fix shall be provided so that when queuedClose is set, writeable() must be called immediately to close(h).

Without source code change or fix, the workaround is to increase the socket send buffer size by the following linux commands so that writeable() can still be triggered after 2-3 minutes queuedClose is set:
sysctl -w net.core.wmem_max=4194304
sysctl -w net.ipv4.tcp_wmem='4096 4194304 4194304'
sysctl -w net.ipv4.route.flush=1

It takes about 15 minutes for a push route (bridge) to restore communications with a remote broker upon network recovery

Details

Description

Attachments

Activity

People

Dates