Uploaded image for project: 'ActiveMQ'
  1. ActiveMQ
  2. AMQ-4465

Rethink replayWhenNoConsumers solution

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.8.0
    • Fix Version/s: 5.9.0
    • Component/s: Broker
    • Labels:
      None

      Description

      I would like to start a discussion about the way we allow messages to be replayed back to the original broker in a broker network, i.e. setting replayWhenNoConsumers=true.

      This discussion is based on the blog post
      http://tmielke.blogspot.de/2012/03/i-have-messages-on-queue-but-they-dont.html
      but I will outline the full story here again.

      Consider a network of two brokers A and B.
      Broker A has a producer that sends one msg to queue Test.in. Broker B has a consumer connected so the msg is transferred to broker B. Lets assume the consumer disconnects from B before it consumes the msg and reconnects to broker A. If broker B has replayWhenNoConsumers=true, the message will be replayed back to broker A.
      If that replay happens in a short time frame, the cursor will mark the replayed msgs as a duplicate and won't dispatch it. To overcome this, one needs to set enableAudit=false on the policyEntry for the destination.

      This has a consequence as it disables duplicate detection in the cursor. External JMS producers will still be blocked from sending duplicates thanks to the duplicate detection built into the persistence adapter.
      However you can still get duplicate messages over the network bridge now. With enableAudit=false these duplicates will be happily added to the cursor now. If the same consumer receives the duplicate message, it will likely detect the duplicate. However if the duplicate message is dispatched to a different consumer, it won't be detected but will be processed by the application.

      For many use cases its important not to receive duplicate messages so the above setup replayWhenNoConsumers=true and enableAudit=false becomes a problem.

      There is the additional option of specifying auditNetworkProducers="true" on the transport connector but that's very likely going to have consequences as well. With auditNetworkProducers="true" we will now detect duplicates over the network bridge, so if there is a network glitch while the message is replayed back on the bridge to broker A and broker B tries to resend the message again, it will be detected as a duplicate on broker A. This is good.

      However lets assume the consumer now disconnects from broker A after the message was replayed back from broker B to broker A but before the consumer actually received the message. The consumer then reconnects to broker B again.
      The replayed message is on broker A now. Broker B registers a new demand for this message (due to the consumer reconnecting) and broker A will pass on the message to broker B again. However due to auditNetworkProducers="true" broker B will treat the resent message as a duplicate and very likely not accept it (or even worse simply drop the message - not sure how exactly it will behave).

      So the message is stuck again and won't be dispatched to the consumer on broker B.
      The networkTTL setting will further have an effect on this scenario and so will have other broker topologies like a full mesh.

      It seems to me that

      • When allowing replayWhenNoConsumers=true you may receive duplicate messages unless you also set auditNetworkProducers="true" which has consequences as well.
      • If consumers are reconnecting to a different broker each time that you may end up with msgs stuck on a broker that won't get dispatched.
      • Ideally you want sticky consumers, i.e. they reconnect to the same broker if possible in order to avoid replaying back messages. This implies that you don't want to use randomize=true on failover urls. I don't think we recommend this in any docs.
      • The network ttl will potentially never be high enough and the message may be stuck on a particular broker as the consumer may have reconnected to another broker in the network.

      I am sure there are more sides to this discussion. I just wanted to capture what gtully and I found when discussing this problem.

        Issue Links

          Activity

          Hide
          wangyin SuoNayi added a comment -

          I believe people are suffering pains when using network of brokers.
          b/w either you receive duplicate messages or you lose messages with volatile consumers.
          We did use sticky consumers to avoid these issues because we know these limits but too many users may not know and it's easy to fall into the trap.

          Show
          wangyin SuoNayi added a comment - I believe people are suffering pains when using network of brokers. b/w either you receive duplicate messages or you lose messages with volatile consumers. We did use sticky consumers to avoid these issues because we know these limits but too many users may not know and it's easy to fall into the trap.
          Hide
          raulvk Raúl Kripalani added a comment -

          At the risk of sparking up an entire different discussion – the real culprit of all this is the store-and-forward technique, in my humble opinion. I think the AMQ model could be essentially flawed for highly dynamic, elastic or cloud-like scenarios, where consumers and producers can appear anywhere in the messaging fabric, and AMQ instances are provisioned and de-provisioned on the fly.

          The replayWhenNoConsumers was a solution to bounce messages freely across the cluster. But really what we need is multiple ACTIVE brokers to see a single view of reality, i.e. a shared knowledge about what messages exist and are pending to be delivered, what consumers are alive and where, etc: a messaging cloud.

          In the era of big data and huge in-memory caches, this seems perfectly doable. I'd advocate for a solution such that:

          • ACTIVE brokers can connect to a single cache/db, no more exclusivity or master locks.
          • Reads and writes must be atomic or transactional, but blazing fast in both cases.
          • All instances see all messages and consumers, but are responsible for only local consumers. They decide when to pick a message from the cache and push it to a consumer.
          • May be embeddable, so that you don't have to start a separate process to use AMQ OOTB.
          • Can be persistent/non-persistent.

          Many NoSQL databases or Java-based distributed cache technologies exist which could fulfill these requirements (probably with some adaptations).

          Show
          raulvk Raúl Kripalani added a comment - At the risk of sparking up an entire different discussion – the real culprit of all this is the store-and-forward technique, in my humble opinion. I think the AMQ model could be essentially flawed for highly dynamic, elastic or cloud-like scenarios, where consumers and producers can appear anywhere in the messaging fabric, and AMQ instances are provisioned and de-provisioned on the fly. The replayWhenNoConsumers was a solution to bounce messages freely across the cluster. But really what we need is multiple ACTIVE brokers to see a single view of reality, i.e. a shared knowledge about what messages exist and are pending to be delivered, what consumers are alive and where, etc: a messaging cloud. In the era of big data and huge in-memory caches, this seems perfectly doable. I'd advocate for a solution such that: ACTIVE brokers can connect to a single cache/db, no more exclusivity or master locks. Reads and writes must be atomic or transactional, but blazing fast in both cases. All instances see all messages and consumers, but are responsible for only local consumers. They decide when to pick a message from the cache and push it to a consumer. May be embeddable, so that you don't have to start a separate process to use AMQ OOTB. Can be persistent/non-persistent. Many NoSQL databases or Java-based distributed cache technologies exist which could fulfill these requirements (probably with some adaptations).
          Hide
          gtully Gary Tully added a comment -

          https://issues.apache.org/jira/browse/AMQ-4607 sorts the network ttl limits issue.
          It also addresses the need for enableAudit=false with replay, as the message audit is rolledback when a network consumer acks a message.

          Show
          gtully Gary Tully added a comment - https://issues.apache.org/jira/browse/AMQ-4607 sorts the network ttl limits issue. It also addresses the need for enableAudit=false with replay, as the message audit is rolledback when a network consumer acks a message.
          Hide
          tmielke Torsten Mielke added a comment -

          Agreed Gary. I will mark this bug as resolved.

          Show
          tmielke Torsten Mielke added a comment - Agreed Gary. I will mark this bug as resolved.
          Hide
          tmielke Torsten Mielke added a comment -

          This should be fixed by the changes in AMQ-4607.

          Show
          tmielke Torsten Mielke added a comment - This should be fixed by the changes in AMQ-4607 .
          Hide
          gtully Gary Tully added a comment -

          I think we need to introduce reliable forwarding, ie: using a two phase local transaction between brokers such that we can totally avoid duplicate sends in the network. This is really the expectation. Any audit will be memory constrained (sliding window) or store constrained (so cannot suppress a duplicate once the original has dispatched).
          see: https://issues.apache.org/jira/browse/AMQ-4944

          Show
          gtully Gary Tully added a comment - I think we need to introduce reliable forwarding, ie: using a two phase local transaction between brokers such that we can totally avoid duplicate sends in the network. This is really the expectation. Any audit will be memory constrained (sliding window) or store constrained (so cannot suppress a duplicate once the original has dispatched). see: https://issues.apache.org/jira/browse/AMQ-4944

            People

            • Assignee:
              Unassigned
              Reporter:
              tmielke Torsten Mielke
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development