Uploaded image for project: 'Qpid'
  1. Qpid
  2. QPID-7655

It takes about 15 minutes for a push route (bridge) to restore communications with a remote broker upon network recovery

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: qpid-cpp-1.35.0, qpid-cpp-1.36.0
    • Fix Version/s: None
    • Component/s: C++ Broker
    • Labels:
      None
    • Environment:

      Linux C++ broker with a bridge (push route) to a remote C++ broker
      and the network connection to the remote broker can be turned off and on

    • Flags:
      Important

      Description

      We have the following use case and issue:
      1. a Linux C++ broker is used on a mobile device.
      2. there is a push route (bridge) between a mobile Linux C++ broker and a remote broker established using the following:
      qpid-route queue add -s -d -v --ack 1 --credit 100 remoteBroker mobileBroker remoteExchange mobileQueue
      3. the network connectivity for the mobile device is intermittent which means that the mobile device could be out of communication and then be in communication for some time. We would like that the qpid broker on the mobile device can recover from communication losses as soon as possible when it transitions from out of communication to in communication. However in the current normal running environment, when this happens, it takes about 15 minutes for the mobile broker to recover communications with the remote broker.

      To re-create the issue, one must:
      1. run the mobile broker with qpid logging set to info+.
      2. using qpid-send to send some messages to the mobile broker periodically say every second (to simulate messages put to the mobileQueue to be routed to the remote broker).

      Then here is what it looked like

      The network connection was turned off 2-3 minutes before 2017-01-24 21:20:03

      then:

      2017-01-24 21:20:03 [Protocol] error Federation link connection qpid.tcp:remoteBroker:5672 missed 2 heartbeats - closing connection
      ..........

      Network connection was actually restored around 2017-01-24 21:21:00. However the inter-broker link could only be restored 15+ minutes later:
      ..........
      2017-01-24 21:36:25 [System] error Error reading socket: Invalid argument(22)
      2017-01-24 21:36:25 [Broker] info Inter-broker link disconnected from remoteBroker:5672 Closed by peer (Note: this disconnect should happen at 2017-01-24 21:20:03 when the connection was closed after 2 heartbeats were missed)
      2017-01-24 21:36:25 [System] info Connecting: remoteBroker:5672
      2017-01-24 21:36:25 [Broker] info Inter-broker link connecting to remoteBroker:5672
      2017-01-24 21:36:25 [Broker] info Inter-broker link established to remoteBroker:5672

      So the heartbeats triggered the connection close at 21:20:03 as expected. However, it almost seems as if the connection didn't close properly. There was nothing in the log after the "closing connection" until the socket error occured 15+ minutes later. No attempts to reconnect, nothing until the socket error. Network connection was back for 15 minutes by the time the connection closed at 21:36:25. It then connected promptly and the messages started being routed out to the remote broker within 5-10 seconds of the link succeeding,

      The following is my analysis of the root cause from C++ source code. There is also a workaround. But the bug needs to be fixed:
      --------

      Here is a bit more detailed issue why qpid broker takes 15+ minutes to close a connection which caused message delay if a device is out of the comms and then in comms again:

      -------------
      The issue was that in /qpid/sys/posix/AsynchIO.cpp, the following function can only be called if there is a writeable socket event

      void AsynchIO::writeable(DispatchHandle& h)

      In the middle of the function there was a place that DispatchHandler will be closed by

      close(h);

      In the same file, there was another function

      void AsynchIO::queueWriteClose()

      In this function it will set a flag of queuedClose to true when there was a communication loss. The flag will trigger close(h) in the aforementioned function of writeable(DispatchHandle& h) when the writeable() function is called next time. However if this function is not triggered, close(h) will not be called until 15 mins later the socket error happens.

      The fix shall be provided so that when queuedClose is set, writeable() must be called immediately to close(h).

      Without source code change or fix, the workaround is to increase the socket send buffer size by the following linux commands so that writeable() can still be triggered after 2-3 minutes queuedClose is set:
      sysctl -w net.core.wmem_max=4194304
      sysctl -w net.ipv4.tcp_wmem='4096 4194304 4194304'
      sysctl -w net.ipv4.route.flush=1

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              chenjun Jun Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: