ActiveMQ .Net
  1. ActiveMQ .Net
  2. AMQNET-293

Consumer is not recovered after failover reconnect if connection had been lost in OnMessage (before SendACK)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0, 1.4.1
    • Fix Version/s: 1.5.0
    • Component/s: NMS
    • Labels:
      None
    • Environment:

      Windows XP - however I believe it's not dependent on the operating system

      Description

      To reproduce the error write the simple WinForm application in C# with listbox (or console app)
      1. Create the connection: failover:(tcp://localhost:61616?keepAlive=true) and start it.
      2. Create the session and the QueueConsumer for queue e.g. "TestQueue" (in default AutoAcknoledged mode)
      3. For queue consumer set the message listener e.g. OnMessage
      4. In OnMessage method do something like Sleep(5000) and then display the received textMessage (by Invoke add the messge to listbox).
      5. By means of localhost:8161/admin create the TestQueue and put to this queue about 20 persistent text messages.
      6. Run the application. You should get on the screen new line every 5 seconds.
      7. Restart the ActiveMQ broker (I'm using 5.3.0.5)
      8. After restarting the broker you stop receiving the messages.
      9. Restart broker again. And you will start getting the messages.

      ----------

      The problem is in NMS.
      Most likely when you restart ActiveMQ broker the client app will be in OnMessage method (just sleeping there for 5 seconds). When those 5 seconds is over then the NMS is trying to SendACK. And this method will not end until failover thread successfully reconnect.. For that time there is a lock on the unconsumedMessages.SynchRoot (see MessageConsumer.Dispatch method). And this fact is painful for another thread which is trying to do the unconsumedMessage.Clear() and needs the locked resource. (this thread is initated in Connection.OnTransportInterrupted() and it wants to call ClearMessagesInProgress on MessageConsumer).
      The worst thing is that the MessageConsumer.ClearMessagesInProgress method cannot call (as waits for locked resource) TransportInterruptionProcessingComplete() which I guess registers consumers which have to be recovered when the connection is back.
      So when the connection is back:
      SendAck completes
      Failover thread DoRecover but does not find our consumer.
      Dispatch method unlocks the unconsumedMessages.SynchRoot
      The working thread registers consumers to be recovered (absolutely too late!!!!)

      1. MessageConsumerDirtyPatch.cs
        3 kB
        Jan Vogelgesang
      2. UnitTestAndPatch.zip
        9 kB
        Jan Vogelgesang
      3. CropperCapture[41].jpg
        81 kB
        Mark Gellings
      4. CropperCapture[43].jpg
        91 kB
        Mark Gellings
      5. CropperCapture[44].jpg
        78 kB
        Mark Gellings

        Activity

        Hide
        Timothy Bish added a comment -

        You should try using the trunk code, I think this is resolved there.
        There's a SNAPSHOT build here: http://people.apache.org/~tabish/nms-1.5.0/

        Show
        Timothy Bish added a comment - You should try using the trunk code, I think this is resolved there. There's a SNAPSHOT build here: http://people.apache.org/~tabish/nms-1.5.0/
        Hide
        Jan Vogelgesang added a comment - - edited

        I've used the trunk code and it's still the same problem there.
        Recently in the trunk there was removed the deadlock, but it doesn't fix the issue described by me.

        Show
        Jan Vogelgesang added a comment - - edited I've used the trunk code and it's still the same problem there. Recently in the trunk there was removed the deadlock, but it doesn't fix the issue described by me.
        Hide
        Timothy Bish added a comment -

        You could help out by providing a unit test and even better a patch to fix the problem.

        Show
        Timothy Bish added a comment - You could help out by providing a unit test and even better a patch to fix the problem.
        Hide
        Jan Vogelgesang added a comment -

        I'm going thru the existing unit tests, but unfortunately I cannot figure out quickly how to translate this scenario to unit test language.
        Anyway the patch could be to call the AfterMessageIsConsumed method out of the lock(unconsumedMessages.SynchRoot)...
        I will attach the dirty fix which works for this simple scenario.

        Show
        Jan Vogelgesang added a comment - I'm going thru the existing unit tests, but unfortunately I cannot figure out quickly how to translate this scenario to unit test language. Anyway the patch could be to call the AfterMessageIsConsumed method out of the lock(unconsumedMessages.SynchRoot)... I will attach the dirty fix which works for this simple scenario.
        Hide
        Jan Vogelgesang added a comment -

        Please review

        Show
        Jan Vogelgesang added a comment - Please review
        Hide
        Jan Vogelgesang added a comment -

        Just added an unit test for this case and new version of MessageConsumer.cs which solves the problem.

        Hope it doesn't destroy any other part of the NMS client. If you have any questions just contact me please.

        In the unit test I've just put the routine which restarts the ActiveMQ windows service - it's not the smartest way of emulating connection break, but it was the quickest to implement.

        Show
        Jan Vogelgesang added a comment - Just added an unit test for this case and new version of MessageConsumer.cs which solves the problem. Hope it doesn't destroy any other part of the NMS client. If you have any questions just contact me please. In the unit test I've just put the routine which restarts the ActiveMQ windows service - it's not the smartest way of emulating connection break, but it was the quickest to implement.
        Hide
        Timothy Bish added a comment -

        moving the code out of the sync block exposes a potential for out of order acks to be sent to the broker which violates that ack model that AMQ uses so this fix can't go into trunk.

        Show
        Timothy Bish added a comment - moving the code out of the sync block exposes a potential for out of order acks to be sent to the broker which violates that ack model that AMQ uses so this fix can't go into trunk.
        Hide
        Jan Vogelgesang added a comment -

        Timothy, thanks for the info. I am not deep in the NMS code and my patch was just to prove that there is a problem with the block section. I hoped that it might be enough to fix it, however, I felt under my skin that it may spoil the delivery logic. Thanks, for analyzing it.

        Another solution which I thought about was not applying failover to sending Acks. That means duplicates messages may be delivered and I'm fine with that.
        Importan't for me is to have the solution for:
        a) reliable transport (I cannot lose any message)
        b) the consumer must recover after connection break (currently it doesn't)

        Anyway were you able to reproduce the problem I described ?
        If so, maybe you have any idea how to fix it. Please keep me informed.

        Show
        Jan Vogelgesang added a comment - Timothy, thanks for the info. I am not deep in the NMS code and my patch was just to prove that there is a problem with the block section. I hoped that it might be enough to fix it, however, I felt under my skin that it may spoil the delivery logic. Thanks, for analyzing it. Another solution which I thought about was not applying failover to sending Acks. That means duplicates messages may be delivered and I'm fine with that. Importan't for me is to have the solution for: a) reliable transport (I cannot lose any message) b) the consumer must recover after connection break (currently it doesn't) Anyway were you able to reproduce the problem I described ? If so, maybe you have any idea how to fix it. Please keep me informed.
        Hide
        Timothy Bish added a comment -

        Working on a fix, may not be done until next week as I'm on travel starting tomorrow.

        Show
        Timothy Bish added a comment - Working on a fix, may not be done until next week as I'm on travel starting tomorrow.
        Hide
        Timothy Bish added a comment -

        Added some changes in trunk along with a modified version of the unit test that doesn't start / stop broker but breaks the connection locally.

        Show
        Timothy Bish added a comment - Added some changes in trunk along with a modified version of the unit test that doesn't start / stop broker but breaks the connection locally.
        Hide
        Mark Gellings added a comment -

        Jan I tested your patch. My unit test in issue AMQNET-294 did pass after applying it. But only with Client Acknowledgement mode, not with Individual Acknowledgement mode.

        It also did not work when I increased the prefetch to 100 (default), sent through 1000 messages, and consumed at a rate of only 100ms delay between messages. Then there was message loss still.

        Tim, are the changes you made to trunk likely to fix our issues? Should I rebuild from trunk and test it out with our tests here?

        Show
        Mark Gellings added a comment - Jan I tested your patch. My unit test in issue AMQNET-294 did pass after applying it. But only with Client Acknowledgement mode, not with Individual Acknowledgement mode. It also did not work when I increased the prefetch to 100 (default), sent through 1000 messages, and consumed at a rate of only 100ms delay between messages. Then there was message loss still. Tim, are the changes you made to trunk likely to fix our issues? Should I rebuild from trunk and test it out with our tests here?
        Hide
        Jan Vogelgesang added a comment -

        Timothy, unfortunately your changes do not fix the problem. Still consumer does not recover after broker failure...
        I've tested also all tags and it works fine in 1.2 and in 1.3. So the problem appeared in 1.4.

        Mark, my patch was just an experiment. It proves that it's something wrong. I suspected that it can help somehow to your issue, but as you said it doesn't solve the problem... You can test your program with NMS 1.3 (if you do not base on any specific functionality/fixes done for 1.4)

        Show
        Jan Vogelgesang added a comment - Timothy, unfortunately your changes do not fix the problem. Still consumer does not recover after broker failure... I've tested also all tags and it works fine in 1.2 and in 1.3. So the problem appeared in 1.4. Mark, my patch was just an experiment. It proves that it's something wrong. I suspected that it can help somehow to your issue, but as you said it doesn't solve the problem... You can test your program with NMS 1.3 (if you do not base on any specific functionality/fixes done for 1.4)
        Hide
        Timothy Bish added a comment -

        These changes were just some initial work that needed to be done to clean up some of the failover handling code. I will work on this more in the coming week once I get back into the office. You are welcome to continue to inspect the code and compare it with the Java client to see where we are going wrong in the mean time.

        Show
        Timothy Bish added a comment - These changes were just some initial work that needed to be done to clean up some of the failover handling code. I will work on this more in the coming week once I get back into the office. You are welcome to continue to inspect the code and compare it with the Java client to see where we are going wrong in the mean time.
        Hide
        Mark Gellings added a comment -

        I have tested against tagged 1.1, 1.2, 1.3, and 1.4.1 versions. Only 1.4.1 do I see messages not come across.

        With 1.1, 1.2, and 1.3 a lot of duplicate messages (sometimes triplets) come across with CLIENT and INDIVIDUAL acknowledgement modes. Duplicate message count is generally equal to the prefetch count. This seems strange when previously ack'd messages are replayed. Should this be working like this? Perhaps I misunderstand message replay so forgive me if so.

        Testing 1.3 with TRANSACTIONAL acknowledgement mode and only recieve one or no duplicates. They come across correctly as redelivered.

        I will include this comment as well as an upgraded unit test file on issue AMQNET-294.

        Show
        Mark Gellings added a comment - I have tested against tagged 1.1, 1.2, 1.3, and 1.4.1 versions. Only 1.4.1 do I see messages not come across. With 1.1, 1.2, and 1.3 a lot of duplicate messages (sometimes triplets) come across with CLIENT and INDIVIDUAL acknowledgement modes. Duplicate message count is generally equal to the prefetch count. This seems strange when previously ack'd messages are replayed. Should this be working like this? Perhaps I misunderstand message replay so forgive me if so. Testing 1.3 with TRANSACTIONAL acknowledgement mode and only recieve one or no duplicates. They come across correctly as redelivered. I will include this comment as well as an upgraded unit test file on issue AMQNET-294 .
        Hide
        Timothy Bish added a comment -

        What version of the broker is being used for your testing?

        Show
        Timothy Bish added a comment - What version of the broker is being used for your testing?
        Hide
        Jan Vogelgesang added a comment -

        I've been using version broker 5.3.0.5 and got the problem when using.both .Net client 1.4 and the one compiled from trunk, it doesn't work with 5.4 java client)
        When I'm using broker in 5.4 and .Net client compiled from trunk it works fine.
        There is a problem with deadlock in NMS 1.4 which has been already fixed in trunk.

        Show
        Jan Vogelgesang added a comment - I've been using version broker 5.3.0.5 and got the problem when using.both .Net client 1.4 and the one compiled from trunk, it doesn't work with 5.4 java client) When I'm using broker in 5.4 and .Net client compiled from trunk it works fine. There is a problem with deadlock in NMS 1.4 which has been already fixed in trunk.
        Hide
        Timothy Bish added a comment -

        Ok, that makes sense. This behaviour happens as a result of the fix for AMQ-2579 and the broker being a 5.3 variant. Since the broker only speaks OpenWire v5 the client is registering the Consumer first with a prefetch of 0 and then later on completion of the onMessage the client attempts then to send a ConsumerControl message to return the Consumer to its original prefetch setting but the broker doesn't get that because that behaviour depends on OpenWire V6.

        Show
        Timothy Bish added a comment - Ok, that makes sense. This behaviour happens as a result of the fix for AMQ-2579 and the broker being a 5.3 variant. Since the broker only speaks OpenWire v5 the client is registering the Consumer first with a prefetch of 0 and then later on completion of the onMessage the client attempts then to send a ConsumerControl message to return the Consumer to its original prefetch setting but the broker doesn't get that because that behaviour depends on OpenWire V6.
        Hide
        Mark Gellings added a comment -

        Testing trunk with ActiveMQ 5.4.2 and there is still blocking. Unit test on AMQNET-294 can be used to replicate.

        Show
        Mark Gellings added a comment - Testing trunk with ActiveMQ 5.4.2 and there is still blocking. Unit test on AMQNET-294 can be used to replicate.
        Hide
        Timothy Bish added a comment -

        A stack trace of all the running threads when the client is blocked would be helpful.

        Show
        Timothy Bish added a comment - A stack trace of all the running threads when the client is blocked would be helpful.
        Hide
        Mark Gellings added a comment -

        Here are screen shots of the stack traces and place in code where blocking occurs.

        This is using IndividualAcknowledgement mode.

        You may need to run the test a few times to replicate as message may not be in listener event, in which I believe blocking does not occur.

        Show
        Mark Gellings added a comment - Here are screen shots of the stack traces and place in code where blocking occurs. This is using IndividualAcknowledgement mode. You may need to run the test a few times to replicate as message may not be in listener event, in which I believe blocking does not occur.
        Hide
        Mark Gellings added a comment -

        Should mention the unit test I was running was "All_messages_should_come_through_for_a_durable_subscription_when_broker_is_restarted" and can be found on AMQNET-294 in Apache.NMS.Test (2).zip attachment.

        Show
        Mark Gellings added a comment - Should mention the unit test I was running was "All_messages_should_come_through_for_a_durable_subscription_when_broker_is_restarted" and can be found on AMQNET-294 in Apache.NMS.Test (2).zip attachment.
        Hide
        Timothy Bish added a comment -

        Thanks, will dig into this some more today.

        Show
        Timothy Bish added a comment - Thanks, will dig into this some more today.
        Hide
        Timothy Bish added a comment -

        I haven't been able to reproduce any sort of deadlock using latest trunk and the provided tests in AMQNET-294

        Show
        Timothy Bish added a comment - I haven't been able to reproduce any sort of deadlock using latest trunk and the provided tests in AMQNET-294
        Hide
        Mark Gellings added a comment -

        That is certainly strange. I replicate the behavior on each time I run the test.

        I see that you revised the unit test a bit and attached your cs file. Were you using that test or did you try the unit test I provided?

        Do you see any differences between the configuration of my ActiveMQ 5.4 brokers vs. yours that could cause behavior change?

        Anything I can do to help with this?

        Show
        Mark Gellings added a comment - That is certainly strange. I replicate the behavior on each time I run the test. I see that you revised the unit test a bit and attached your cs file. Were you using that test or did you try the unit test I provided? Do you see any differences between the configuration of my ActiveMQ 5.4 brokers vs. yours that could cause behavior change? Anything I can do to help with this?
        Hide
        Timothy Bish added a comment -

        I ran your test, but revised it to allow for a clear comparison with Java client code (also attached) I couldn't find any difference between the behaviour of java and .net clients other than the fact that the .net client doesn't have the duplicate filtering code that the java code does. I don't a database setup like yours so my persistence adapter is different in the activemq.xml.

        Show
        Timothy Bish added a comment - I ran your test, but revised it to allow for a clear comparison with Java client code (also attached) I couldn't find any difference between the behaviour of java and .net clients other than the fact that the .net client doesn't have the duplicate filtering code that the java code does. I don't a database setup like yours so my persistence adapter is different in the activemq.xml.
        Hide
        Timothy Bish added a comment -

        Added a fix to make the client act as it did before when connected to a 5.3 broker when the openwire protocol version in use doesn't support the new style consumer recovery model.

        Show
        Timothy Bish added a comment - Added a fix to make the client act as it did before when connected to a 5.3 broker when the openwire protocol version in use doesn't support the new style consumer recovery model.

          People

          • Assignee:
            Timothy Bish
            Reporter:
            Jan Vogelgesang
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development