Uploaded image for project: 'ActiveMQ .Net'
  1. ActiveMQ .Net
  2. AMQNET-656

AMQP failover implementation fails to reconnect in some cases

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • AMQP-1.8.1
    • None
    • AMQP
    • None
    • Patch Available

    Description

      We recently had an issue where some of our producer instances were able to reconnect immediately after the master/slave (or primary/standby) broker cluster failed over, while others never reconnected.

      It appears to be related to two existing JIRAs, AMQNET-624 (addressed in GitHub PR#45) and AMQNET-626 (new issue raised in the same PR, but closed without any changes).

      Regarding the bug identified and fixed in AMQNET-624, part of the original solution was pulled back: where TriggerReconnectionAttempt was called via Task.Run, to instead call it directly. The second issue, AMQNET-626 was to raise concern about the unawaited task returned by TriggerReconnectionAttempt.

      I think perhaps calling from Task.Run may have been beneficial after all: it ensured that TriggerReconnectionAttempt was running on a thread from the thread pool. Otherwise, when TriggerReconnectionAttempt calls ScheduleReconnect, and ScheduleReconnect does an await, that is occurring from within a lock statement.

      As noted in MSDN, an await statement cannot occur inside of a lock statement. That includes anywhere in the call stack, as far as I understand. If you do, it is not caught by the compiler, but can lead to failures e.g. where the task being awaited never gets scheduled.

      Invoking TriggerReconnectionAttempt from a thread pool thread (or another background thread) is one way to avoid this issue, and using Task.Run() might be the easiest way, even though it may also raise eyebrows. Any performance overhead of Task.Run() shouldn't be a factor, since it is only invoked upon losing connection, not continuously.

      The call to Task.Run() could also be moved into TriggerReconnectionAttempt, like so:

      // this is invoked using Task.Run, to ensure it runs on a thread pool thread
      // in case this was invoked from inside a lock statement (which it is)
      return Task.Run(async () => await reconnectControl.ScheduleReconnect(Reconnect));

      It does still leave the issue identified in AMQNET-626, that the result is not checked, but it resolves the failover failure caused by calling await inside of a lock.

      (Another way around this would be to use a SemaphoreSlim, or other async-compatible synchronization mechanism instead of a lock statement. However, that could have far-reaching implications, since lock statements are used in many parts of the AMQP implementation.)

      Attachments

        Activity

          People

            Unassigned Unassigned
            brudo Bruce Dodson
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 8.5h
                8.5h