[ARTEMIS-3767] Replication inconsistencies between 2.17 and main - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.18.0, 2.19.0, 2.19.1, 2.20.0, 2.21.0, 2.22.0, 2.23.0, 2.23.1, 2.24.0
Fix Version/s: 2.24.0
Component/s: Broker
Labels:
None
Environment:

AWS EC2 t3a.large

CentOS Linux release 7.9.2009

OpenJDK 8, OpenJDK 11

Description

It's not possible to perform a rolling upgrade in replication environment. After upgrading the slave from 2.17 to 2.18 it reports:

AMQ214013: Failed to decode packet: java.lang.IndexOutOfBoundsException: readerIndex(57) + length(1) exceeds writerIndex(57): PooledUnsafeDirectByteBuf(ridx: 57, widx: 57, cap: 57)

The 2.17 master then crashes with an exception:

2022-04-07 10:01:23,032 WARN  [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=AMQ229114: Replication synchronization process timed out after waiting 30,000 milliseconds: ActiveMQReplicationTimeooutException[errorType=REPLICATION_TIMEOUT_ERROR message=AMQ229114: Replication synchronization process timed out after waiting 30,000 milliseconds]
        at org.apache.activemq.artemis.core.replication.ReplicationManager.sendSynchronizationDone(ReplicationManager.java:660) [artemis-server-2.17.0.jar:2.17.0]
        at org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:717) [artemis-server-2.17.0.jar:2.17.0]
        at org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation$2.run(SharedNothingLiveActivation.java:180) [artemis-server-2.17.0.jar:2.17.0]
        at java.base/java.lang.Thread.run(Thread.java:829) [java.base:]

Upgrades from lower versions (or to higher versions) aren't possible either.

Steps to replicate the issue:

Create a master instance (replace the IPs to match your setup):

apache-artemis-2.17.0/bin/artemis create --aio --allow-anonymous --user admin --password admin --clustered --cluster-user admin --cluster-password admin --host 10.35.4.16 --http-host 10.35.4.16 --replicated --staticCluster tcp://10.35.4.211:61616 -- broker-master

Start the instance:
```
broker-master/bin/artemis run
```

Create a slave instance (it's fine to start the 2.18 right away, no need for a real upgrade):

apache-artemis-2.18.0/bin/artemis create --aio --allow-anonymous --user admin --password admin --clustered --slave --cluster-user admin --cluster-password admin --host 10.35.4.211 --http-host 10.35.4.211 --replicated --staticCluster tcp://10.35.4.16:61616 -- broker-slave

Start the instance:
```
broker-slave/bin/artemis run 
```

The master crashes while the slave keeps running doing nothing.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

broker-master.log
07/Apr/22 11:53
12 kB
Jan Šmucr
broker-slave.log
07/Apr/22 11:53
15 kB
Jan Šmucr

Issue Links

is caused by

ARTEMIS-3340 Replicated Journal quorum-based logical timestamp/version

Closed

links to

GitHub Pull Request #4144

GitHub Pull Request #4150

Activity

People

Assignee:: Clebert Suconic

Reporter:: Jan Šmucr

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Apr/22 11:53

Updated:: 21/Jul/22 22:01

Resolved:: 21/Jul/22 22:01

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h