Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
2.1.0
-
None
-
None
Description
We are running artemis in a master/slave setup (1 master, 1 slave) using replication. We had artemis 2.0.0 and reported this issue: ARTEMIS-1161 . Following advice, we replaced artemis 2.0.0 with 2.1.0 and the failback still doesn't work, failing in a different form though, in about 50% of our test experiments.
Now, during failback, master reports:
07:11:22,962 INFO [org.apache.activemq.artemis.integration.bootstrap] AMQ101000: Starting ActiveMQ Artemis Server 07:11:22,975 INFO [org.apache.activemq.artemis.core.server] AMQ221000: live Message Broker is starting with configuration Broker Configuration (clustered=true,journalDirectory=/var/lib/artemis/shadowbroker/data/journal,bindingsDirectory=/var/lib/artemis/shadowbroker/data/bindings,largeMessagesDirectory=/var/lib/artemis/shadowbroker/data/largemessages,pagingDirectory=/var/lib/artemis/shadowbroker/data/paging) 07:11:23,416 INFO [org.apache.activemq.artemis.core.server] AMQ221055: There were too many old replicated folders upon startup, removing /var/lib/artemis/shadowbroker/data/bindings/oldreplica.313 07:11:23,421 INFO [org.apache.activemq.artemis.core.server] AMQ222162: Moving data directory /var/lib/artemis/shadowbroker/data/bindings to /var/lib/artemis/shadowbroker/data/bindings/oldreplica.315 07:11:23,423 INFO [org.apache.activemq.artemis.core.server] AMQ221055: There were too many old replicated folders upon startup, removing /var/lib/artemis/shadowbroker/data/journal/oldreplica.313 07:11:23,426 INFO [org.apache.activemq.artemis.core.server] AMQ222162: Moving data directory /var/lib/artemis/shadowbroker/data/journal to /var/lib/artemis/shadowbroker/data/journal/oldreplica.315 07:11:23,427 INFO [org.apache.activemq.artemis.core.server] AMQ221055: There were too many old replicated folders upon startup, removing /var/lib/artemis/shadowbroker/data/paging/oldreplica.252 07:11:23,427 INFO [org.apache.activemq.artemis.core.server] AMQ222162: Moving data directory /var/lib/artemis/shadowbroker/data/paging to /var/lib/artemis/shadowbroker/data/paging/oldreplica.254 07:11:23,437 INFO [org.eclipse.jetty.util.log] Logging initialized @1968ms to org.eclipse.jetty.util.log.Slf4jLog 07:11:23,448 INFO [org.apache.activemq.artemis.core.server] AMQ221012: Using AIO Journal 07:11:23,533 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE 07:11:23,533 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP 07:11:23,534 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ 07:11:23,534 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT 07:11:23,534 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE 07:11:23,535 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP 07:11:23,591 INFO [org.eclipse.jetty.server.Server] jetty-9.4.z-SNAPSHOT 07:11:23,643 INFO [org.apache.activemq.artemis.core.server] AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.1.0 [null] started, waiting live to fail before it gets active 07:11:23,702 INFO [org.eclipse.jetty.webapp.StandardDescriptorProcessor] NO JSP Support for /jolokia, did not find org.eclipse.jetty.jsp.JettyJspServlet 07:11:23,713 INFO [org.eclipse.jetty.server.session] DefaultSessionIdManager workerName=node0 07:11:23,714 INFO [org.eclipse.jetty.server.session] No SessionScavenger set, using defaults 07:11:23,715 INFO [org.eclipse.jetty.server.session] Scavenging every 660000ms 07:11:23,764 INFO [org.eclipse.jetty.ContextHandler.jolokia] jolokia-agent: No access restrictor found, access to any MBean is allowed 07:11:23,910 INFO [org.eclipse.jetty.server.handler.ContextHandler] Started o.e.j.w.WebAppContext@127a7a2e{/jolokia,file:///var/lib/artemis/shadowbroker/tmp/jetty-localhost-8161-jolokia.war-_jolokia-any-6088373165208569604.dir/webapp/,AVAILABLE}{/opt/artemis/web/jolokia.war} 07:11:23,911 INFO [org.eclipse.jetty.server.handler.ContextHandler] Started o.e.j.s.h.ContextHandler@14008db3{/,file:///opt/artemis/web/,AVAILABLE} 07:11:23,925 INFO [org.eclipse.jetty.server.AbstractConnector] Started ServerConnector@28d18df5{HTTP/1.1,[http/1.1]}{localhost:8161} 07:11:23,926 INFO [org.eclipse.jetty.server.Server] Started @2459ms 07:11:23,927 INFO [org.apache.activemq.artemis] AMQ241001: HTTP Server started at http://localhost:8161 07:11:23,927 INFO [org.apache.activemq.artemis] AMQ241002: Artemis Jolokia REST API available at http://localhost:8161/jolokia 07:11:24,745 INFO [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::serverUUID=47e70ff8-2b2c-11e7-84d3-525400ec0962 is synchronized with live-server. 07:11:24,760 INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced 07:12:54,442 INFO [org.apache.activemq.artemis.core.server] AMQ221037: ActiveMQServerImpl::serverUUID=47e70ff8-2b2c-11e7-84d3-525400ec0962 to become 'live' 07:12:54,556 WARN [org.apache.activemq.artemis.core.server] AMQ222165: No Dead Letter Address configured for queue test-load in AddressSettings 07:12:54,556 WARN [org.apache.activemq.artemis.core.server] AMQ222166: No Expiry Address configured for queue test-load in AddressSettings 07:12:54,910 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live 07:12:54,933 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started EPOLL Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE] 07:12:54,936 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started EPOLL Acceptor at 0.0.0.0:61613 for protocols [STOMP]
* Note the logs: Backup server ActiveMQServerImpl::serverUUID=xxx is synchronized with live-server / backup announced
and slave reports:
07:11:23,818 INFO [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending AIOSequentialFile:/var/lib/artemis/shadowbroker/data/journal/activemq-data-1759.amq (size=10,485,760) to replica. 07:11:24,352 INFO [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending NIOSequentialFile /var/lib/artemis/shadowbroker/data/bindings/activemq-bindings-1448.bindings (size=1,048,576) to replica. 07:11:24,366 INFO [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending NIOSequentialFile /var/lib/artemis/shadowbroker/data/bindings/activemq-bindings-1456.bindings (size=1,048,576) to replica. 07:11:24,389 INFO [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending NIOSequentialFile /var/lib/artemis/shadowbroker/data/bindings/activemq-bindings-1447.bindings (size=1,048,576) to replica. 07:12:54,399 WARN [org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager] AMQ119114: Replication synchronization process timed out after waiting 90,000 milliseconds: java.lang.IllegalStateException: AMQ119114: Replication synchronization process timed out after waiting 90,000 milliseconds at org.apache.activemq.artemis.core.replication.ReplicationManager.sendSynchronizationDone(ReplicationManager.java:607) [artemis-server-2.1.0.jar:2.1.0] at org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:579) [artemis-server-2.1.0.jar:2.1.0] at org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation$2.run(SharedNothingLiveActivation.java:171) [artemis-server-2.1.0.jar:2.1.0] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_92] 07:12:54,401 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: AMQ119014: Did not receive data from /172.23.12.5:37606 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 07:12:54,401 WARN [org.apache.activemq.artemis.core.server] AMQ222092: Connection to the backup node failed, removing replication now: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ119014: Did not receive data from /172.23.12.5:37606 within the 60,000ms connection TTL. The connection will now be closed.] at org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread$2.run(RemotingServiceImpl.java:723) [artemis-server-2.1.0.jar:2.1.0] at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:101) [artemis-commons-2.1.0.jar:2.1.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_92] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_92] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_92] 07:12:54,406 WARN [org.apache.activemq.artemis.core.server] AMQ222013: Error when trying to start replication: java.lang.IllegalStateException: AMQ119114: Replication synchronization process timed out after waiting 90,000 milliseconds at org.apache.activemq.artemis.core.replication.ReplicationManager.sendSynchronizationDone(ReplicationManager.java:607) [artemis-server-2.1.0.jar:2.1.0] at org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:579) [artemis-server-2.1.0.jar:2.1.0] at org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation$2.run(SharedNothingLiveActivation.java:171) [artemis-server-2.1.0.jar:2.1.0] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_92] 07:12:54,412 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: AMQ119014: Did not receive data from /172.20.98.64:45224 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 07:12:54,412 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 8431e0ae-3af1-11e7-ac67-0acf5bbeb077 07:12:54,416 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 8431e0ae-3af1-11e7-ac67-0acf5bbeb077 07:12:54,416 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 8432a3ff-3af1-11e7-ac67-0acf5bbeb077 07:12:54,418 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 8432a3ff-3af1-11e7-ac67-0acf5bbeb077 07:12:54,418 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 84338e60-3af1-11e7-ac67-0acf5bbeb077 07:12:54,421 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 84338e60-3af1-11e7-ac67-0acf5bbeb077 07:12:54,421 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 8435b141-3af1-11e7-ac67-0acf5bbeb077 07:12:54,422 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 8435b141-3af1-11e7-ac67-0acf5bbeb077 07:12:54,422 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 8436e9c2-3af1-11e7-ac67-0acf5bbeb077 07:12:54,424 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 8436e9c2-3af1-11e7-ac67-0acf5bbeb077 07:12:54,424 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 8437fb33-3af1-11e7-ac67-0acf5bbeb077 07:12:54,425 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 8437fb33-3af1-11e7-ac67-0acf5bbeb077 07:13:31,653 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 8431e090-3af1-11e7-b91a-0acf5bbeb077 07:13:31,654 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 8431e090-3af1-11e7-b91a-0acf5bbeb077 07:13:31,654 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 8432a3e1-3af1-11e7-b91a-0acf5bbeb077 07:13:31,655 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 8432a3e1-3af1-11e7-b91a-0acf5bbeb077 07:13:31,655 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 84336732-3af1-11e7-b91a-0acf5bbeb077 07:13:31,655 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 84336732-3af1-11e7-b91a-0acf5bbeb077 07:13:31,655 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 84340373-3af1-11e7-b91a-0acf5bbeb077 07:13:31,656 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 84340373-3af1-11e7-b91a-0acf5bbeb077 07:13:31,656 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 8435b124-3af1-11e7-b91a-0acf5bbeb077 07:13:31,656 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 8435b124-3af1-11e7-b91a-0acf5bbeb077 07:13:31,656 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 84367475-3af1-11e7-b91a-0acf5bbeb077 07:13:31,663 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 84367475-3af1-11e7-b91a-0acf5bbeb077
* Note the logs: Replication synchronization process timed out after waiting 90,000 milliseconds / Did not receive data from /xxx:37606 within the 60,000ms connection TTL
Same as before, master has the following ha-policy:
<ha-policy> <replication> <master> <check-for-live-server>true</check-for-live-server> <initial-replication-sync-timeout>90000</initial-replication-sync-timeout> </master> </replication> </ha-policy>
and slave has:
<ha-policy> <replication> <slave> <allow-failback>true</allow-failback> <initial-replication-sync-timeout>90000</initial-replication-sync-timeout> </slave> </replication> </ha-policy>
For running the tests, we send messages using the artemis producer - command:
./artemis producer --url "tcp://xxx:61616?ha=true&retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1" --destination queue://test-load --message-count 60000 --user xxx --password xxx --sleep 200 --message-size 4000 --threads 5
and consume messages with the artemis consumer - command:
./artemis consumer --url "tcp://xxx:61616?ha=true&retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1" --destination queue://test-load --threads 5 --message-count 60000 --user xxx --password xxx
Please let us know if we can provide some other info about this new issue.