[IGNITE-20124] Prevent double storage updates within primary - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- ignite-3
- transactions

Epic Link:
AI3 Performance

Description

Motivation

In order to preserve the guarantee that the primary replica is always up-to-date it's required to:

In case of common RW transaction - insert writeIntent to the storage within primary before replication.
In case of one-phase-commit - insert commitedWrite after the replication.

Both have already been done. However, that means that if primary is part of the replication group, and it's true in almost all cases, we will double the update:

In case of common RW transaction - through the replication.
In case of one-phase-commit - either through the replication, or though post update, if replication was fast enough.

Definition of Done

Prevent double storage updates within primary.

Implementation Notes

The easiest way to prevent double insert is to skip one if local safe time is greater or equal to candidates. There are 3 places where we update partition storage:

Primary pre-replication update. In that case, the second update on replication should be excluded.
Primary post-replication update in case of 1PC. It's possible to see already updated data if replication was already processed locally. It is expected to be already covered in https://issues.apache.org/jira/browse/IGNITE-15927 . We should check the primary safe time on post-replication update and don't do update if the safe time is already adjusted.
Insert through replication. In case of !1PC on every primary there will be double insert (see 1). In case of 1PC it depends, so we should check the safe time on primary to know whether the update should be done (see 2).

In every case, the storage indexes still should be adjusted on replication, as it is done now, because the progress of indexes on FSM write operations should not be violated - otherwise, a Raft snapshot-based rebalance would be broken. We may have two non-consistent storage updates on primary which may affect different fsyncs, but the transactional correctness isn't violated by these non-consistent storage updates, because there is only a possibility that some writes or write intents will go ahead of indexes and therefore will be included into snapshots - however we still can process such writes and resolve write intents.

Also, the safe time needs to be updated on the primary replica now. There can be following scenarios:

Two-phase commit: we can advance safe time on primary, make pre-replication update and then run Raft command. Both safe time adjustment and storage update happen before replication.
One-phase commit: safe time should be advanced after completeness of Raft command future. There is no happens-before between the future callback and the replication handler, so the safe time should be checked and advanced in both places. We should use some critical section, preventing race between safe time check, safe time adjustment and storage update. For this purpose we can use row locks that are taken inside of storage#runConsistently.

Attachments

Activity

People

Assignee:: Alexander Lapin

Reporter:: Alexander Lapin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Aug/23 14:42

Updated:: 13/Jun/24 07:47