Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Slack discussion with jbertram see here:
https://the-asf.slack.com/archives/CFL910J30/p1699520764243169
Initial Description
consumers-window-size set to 0 and there some queues, that are heavily used and have consumers attached, that allow for a concurrency of 500 messages (larger than thread-pool-client-size).
After some time - unclear what triggers it - the behavior becomes bad in the way that from those queues only 1-5 messages are fetched and immediately processed, but then there is a pause of 10 seconds before the next 1-5 messages are fetched. This continues.
In the thread-dumps, several like the following can be seen:
"Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 runnable [0x00007f8fdbae5000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method) - parking to wait for <0x00000003f6c9d1d0> (a org.apache.activemq.artemis.utils.AbstractLatch$CountSync) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074) at org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
which corresponds to code location with a 10 seconds sleep here:
https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854
It seems it always needs to wait until the 10 seconds are reached, which is unexpected, since the latch should count-down before, after having sent the credit to the server. We could assume some bug, that maybe increments the latch, but misses to decrement, so it is not able to reach 0 ever again, but that's speculative.
Evaluation
See slack for details, it boils down to the thread-pool in use being out of free threads, so the slow-consumer-handling is not able to run in good time.
Affected versions: all since at least 2.22.0 up to current - probably since "ever", involved code derives from HornetQ.
Possible Solutions
We experimented a bit and found a solution for us - or actually 2 variants.
- No. 1 just uses an own single-threaded executor for flow-control, whose only job is to guarantee the credit was sent before continuing with other operations when starting/resetting slow-consumers.
This looks simple and works perfectly in the test-case as well as in the real product in stress-tests
https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e - No. 2 has a similar idea, but goes the full way of adding an own thread-pool for flow-control and adding it like the others.
This is more fitting into the style of the existing thread-pools and allows its configuration.
https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0
and jbertram said:
I think either one would be fine. Create a Jira and send a PR.
So here is the Jira and I will push No. 2 as a pull-request: https://github.com/apache/activemq-artemis/pull/4699