[ARTEMIS-2958] Timed out waiting pool stop on backup restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.16.0
Component/s: Broker, JMX
Labels:
None

Description

The changes on ~~ARTEMIS-2823~~ have caused the AMQ thread pool stop on server stop to be moved after callDeActiveCallbacks(), while the changes on ~~ARTEMIS-2838~~ to server.getServer().getManagementService().unregisterHawtioSecurity() on callDeActiveCallbacks() have moved the HawtioSecurity de-registration to happen on server stop.
it means that on server restart, if the thread pool has a slow stop, JMX won't be available until a new start.
The thread pool stop can block stopping if there is a long task still running/blocked in the pool and the default strategy while stopping the broker is to await 10 seconds before forcing a shutdown of the pending tasks ie JMX can be unavailable for at least 10 seconds after callDeActiveCallbacks() and before a new (pre)activation will register it again.

This stealthy behaviour has been captured by random failures on org.apache.activemq.artemis.tests.smoke.jmxfailback.JmxFailbackTest#testFailbackOnJMX, due to some randomly blocked task in the thread pool.
The core issue that cause the thread pool to be randomly blocked was present by long time, but the unavailability time window of JMX introduced by the mentioned JIRAs was the change that has triggered the bomb.
The test check for 5 seconds the availability of backup JMX connection during a backup restart (on failback): given that the default thread pool await time is 10 seconds, a longer thread pool stop will make the test to fail.

It seems by a thread dump inspection that the pending task is:

jmx-failback2-out:"Thread-1 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1f97cf0d)" Id=43 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
jmx-failback2-out:      at sun.misc.Unsafe.park(Native Method)
jmx-failback2-out:      -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2e2776c1
jmx-failback2-out:      at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
jmx-failback2-out:      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
jmx-failback2-out:      at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:504)
jmx-failback2-out:      -  locked java.lang.Object@607e79a2
jmx-failback2-out:      at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
jmx-failback2-out:      at org.apache.activemq.artemis.core.server.cluster.ClusterControl.authorize(ClusterControl.java:80)
jmx-failback2-out:      at org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:271)
jmx-failback2-out:      at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
jmx-failback2-out:      at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
jmx-failback2-out:      at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
jmx-failback2-out:      at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$38/327040562.run(Unknown Source)
jmx-failback2-out:      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
jmx-failback2-out:      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
jmx-failback2-out:      at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
jmx-failback2-out:
jmx-failback2-out:      Number of locked synchronizers = 1
jmx-failback2-out:      - java.util.concurrent.ThreadPoolExecutor$Worker@6e676dc8

And indeed it seems that both BackupManager::stop and BackupManager::activated are calling BackupConnector::close that's calling closeLocator(backupServerLocator) without unblocking clusterControl.authorize().

A possible fix would be to correctly unblock any blocking call on both cases, to clean stop BackupManager and let the broker thread pool to immediately stop.

Attachments

Issue Links

is caused by

ARTEMIS-2823 Improve JDBC connection management

Closed

ARTEMIS-2838 Migrate Console to use Hawtio 2

Closed

links to

GitHub Pull Request #3312

Activity

People

Assignee:: Francesco Nigro

Reporter:: Francesco Nigro

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Oct/20 06:04

Updated:: 03/Nov/20 01:15

Resolved:: 03/Nov/20 01:15

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2.5h