Uploaded image for project: 'ActiveMQ Artemis'
  1. ActiveMQ Artemis
  2. ARTEMIS-3583

Race condition in QueueImpl#expireReferences(java.lang.Runnable) can cause spurious AMQ224013 warnings to be emitted.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.18.0
    • None
    • Broker
    • None
    • Large-core machines running embedded ActiveMQ broker.

    Description

      A recent change - https://github.com/apache/activemq-artemis/pull/3654 - appears to introduce a race condition in QueueImpl#expireReferences(Runnable) that can cause the Runnable "done" parameter to never be invoked. This manifests itself as spurious "AMQ224013: failed to expire messages for queue" messages.

      The problem is this section of code and the non-atomic test-and-set on the expiryScanner.scannerRunning AtomicInteger:

      if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) {  <--- "test"
        if (expiryScanner.scannerRunning.incrementAndGet() == 1) { <--- "and-set"
          expiryScanner.doneCallback = done;
        }
        getExecutor().execute(expiryScanner);
      } else {
        // expire is already happening on this queue, move on!
        if (done != null) {
          done.run();
      }
      
      

      Consider the following sequence of events and see if you agree with my analysis:

      1. Two threads, t1 and t2
      2. t1 runs PostOfficeImpl.ExpiryReaper which calls QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration is "done". It waits 10 seconds for the latch to drop. If the latch doesn't drop within the timeout it issues a AMQ224013 warning.
      3. t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which in turn calls QueueImpl.expireReferences(null)
      4. t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they both enter that branch. <--- This is where things start to go wrong, as only one of the two threads will successfully set expiryScanner.done; if the thread that "loses" is the one that supplies a non-null "done" parameter then that Runnable will never be invoked.
      5. t2 successfully increments expiryScanner.scannerRunning to 1 and sets expiryScanner.doneCallback = null (since it passed null as the "done" parameter) and then continues to run expiryScanner.
      6. t1 doesn't increment expiryScanner.scannerRunning and at this point the "done" argument is lost and will never be invoked, guaranteeing the AMQ224013 warning. t1 then also runs the expiryScanner (which has already been submitted once by t2)

      Because of the non-atomic test-and-set, the callback is not guaranteed to be invoked and, also, more than one expiry run will execute simultaneously (this seems to be undesirable based on why this change was made in the first place).

      Should the code not be something like this (I'm not 100% familiar with the semantics of scannerRunning being > 1, so this may not be the correct solution):

       

      if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) { <--- test-and-set
        expiryScanner.doneCallback = done;
        getExecutor().execute(expiryScanner);
      } else {
        // expire is already happening on this queue, move on!
        if (done != null) {
          done.run();
      }
      

       

      This guarantees that only one thread can enter the section that sets expiryScanner.doneCallback and submits expiryScanner for execution; all other threads will immediately have their Runnable invoked. 

      We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while testing on very large instances (64 CPUs and up) that are very busy.

      Attachments

        Activity

          People

            Unassigned Unassigned
            awallgren Anders Wallgren
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: