Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Shared-nothing replication can cause journal misalignment despite no split-brain events.
There are several ways that can cause this to happen.
Below some scenario that won't involve network partitions/drastic outages.
Scenario 1:
- Master/Primary start as live, clients connect to it
- Backup become an in-sync replica
- User stop live and backup failover to it
- Backup serve clients alone, modifying its journal
- User stop backup
- User start master/primary: it become live with a journal misaligned to the most up-to-date one ie on the stopped backup
Scenario 2 (involving network glitch):
- Master/Primary start as live, clients connect to it
- Backup become an in-sync replica
- Connection glitch between backup -> live
- backup start trying to failover (for vote-retries * vote-retry-wait milliseconds)
- Live serve clients alone, modifying its journal
- User stop live
- Backup succeed to failover: it become live with a journal misaligned to the most up-to-date one ie on the stopped live
The main cause of this issue is because we allow a single broker to serve clients, despite configured with HA, generating the journal misalignment.
The quorum service (classic or pluggable) just take care of mutual exclusive presence of broker for the live role (vs a NodeID), without considering live role ordering based on the most up-to-date journal.
A possible solution is to use https://issues.apache.org/jira/browse/ARTEMIS-2716 and use a quorum "logical timestamp/version" marking the age/ownership changes of the journal in order to force live to always have the most up-to-date journal. It means that such value has to be locally saved and exchanged during the initial replica sync, involving both journal data and core message protocol changes (just for the replication channel, without impacting real clients).
In case of quorum service restart/outage, admin must use command/configuration to let a broker to ignore the age of its journal and just force it to start.
In addition new journal CLI commands should be implemented to inspect the age of a (local) broker journal or query/force the quorum journal version too, for troubleshooting reasons.
It's very important to capture every possible event that cause the journal age/ownership to change.
Now let's take a look at Scenario 2 with journal versioning/timestamp:
- live broker start because it matches the most up to date journal version, increasing it (locally and remotely) when it become fully alive
- backup found it and trust that, given that's live, it already has the most-up-to-date journal for a specific NodeID
- live broker send its journal files to the backup, along with its local journal version
- backup is now ready to failover in any moment: it store the sent journal version on its local storage
- network glitch happen
- backup try to become live for vote-retries times
- live detect replication disconnection and increment the journal version (both quorum and local one)
- live serve clients alone, modifying its journal
- outage/stop cause live to die
- backup detect that quorum journal version no longer match its own local journal version, meaning that something has happened in the meantime: it stop trying to become live
The key parts related to journal age/version are:
- only who's live can change quorum (and local) journal version (with a monotonic increment)
- every ownership change event must cause journal age/version to change eg starting as live, loosing its backup, etc etc
Re the RI implementation using Apache Curator, this could use a separate DistributedAtomicLong to manage the journal version.
Although tempting, it's not a good idea to not just use the data field on InterProcessSemaphoreV2, because:
- there's no API to query it if no lease is acquired yet (or created)
- data cannot change while the lock is acquired: it won't allow to increase journal age because of replica drop
Athough tempting, it's not a good idea to just use the last alive broker connector identity instead of a journal version, because of the ABA problem (see https://en.wikipedia.org/wiki/ABA_problem).
This versioning mechanism isn't without drawbacks: quorum journal versioning requires to store a local copy of the version in order to allow the broker to query and compare it with the quorum one on restart; having 2 separate and not atomic operations means that there must be a way to reconcile/fix it in case of misalignments. As said above, this could be done with admin operations.
Journal versioning change the way roles behave, but they still retain theirs key characteristics:
- backup should try start as live in case it has the most up to date journal and there is no other live around: differently, can just rotate journal and be available to replicate some live
- primary try to fail-back to any existing live with the most up to date journal or await it to appear, without becoming live if it doesn't have the most up-to-date journal
This would ensure that If both broker are up and running and backup allow a primary to failback, the primary eventually become live and backup replicates it, preserving the desired broker roles.
Attachments
Issue Links
- causes
-
ARTEMIS-3767 Replication inconsistencies between 2.17 and main
- Closed
- is blocked by
-
ARTEMIS-3402 Split Brain detection should reject bad member updates
- Closed
- is related to
-
ARTEMIS-2716 Implements pluggable Quorum Vote
- Closed
- links to