[IGNITE-17263] Implement leader to replica safe time propagation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-beta1
Component/s: None
Labels:
- ignite-3
- transaction3_ro

Epic Link:
Ignite 3 transactions

Description

In order to perform replica reads, it's required either to use read index or check the safe time. Let's recall corresponding section from tx design document.

RO transactions can be executed on non-primary replicas. write intent resolution doesn’t help because a write intent for a committed transaction may not be yet replicated to the replica. To mitigate this issue, it’s enough to run readIndex on each mapped partition leader, fetch the commit index and wait on a replica until it’s applied. This will guarantee that all required write intents are replicated and present locally. After that the normal write intern resolution should do the job.

There is a second option, which doesn’t require the network RTT. We can use a special low watermark timestamp (safeTs) per replication group, which corresponds to the apply index of a replicated entry, so then an apply index is advanced during the replication, then the safeTs is monotonically incremented too. The HLC used for safeTs advancing is assigned to a replicated entry in an ordered way.

Special measures are needed to periodically advance the safeTs if no updates are happening. It’s enough to use a special replication command for this purpose.

All we need during RO txn is to wait until a safeTs advances past the RO txn readTs.

In the picture we have two concurrent transactions mapped to the same partition: T1 and T2.
OpReq(w1) and OpReq(w2) are received concurrently. Each write intent is assigned a timestamp in a monotonic order consistent with the replication order. This can be for example done when replication entries are dequeued for processing by replication protocol (we assume entries are replicated successively.

It’s not enough only to wait for safeTs - it may never happen due to absence of activity in the partition. Consider the next diagram:

We need an additional safeTsSync command to propagate a safeTs event in case there are no updates in the partition.

We need to linerialize safe time updates in all cases including leader change. So we need a guarantee that safe time on non-primary replicas never will be greater than HLC on leader (as we assume that primary replica is colocated with leader). We are going to solve this problem by associating every potential value of safeTime (propagated to the replica from leader via appendEntries) with some log index, and this value (safe time candidate) should be applied as new safe time value at the moment when corresponding index is committed.

Hence, the safeTimeSyncCommand also should be a Raft write command.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

20240409180024.jpg
09/Apr/24 10:10
62 kB
yexiaowei
Screenshot from 2022-07-06 16-48-41.png
06/Jul/22 13:49
47 kB
Alexander Lapin
Screenshot from 2022-07-06 16-48-30.png
06/Jul/22 13:49
41 kB
Alexander Lapin

Issue Links

Dependency

IGNITE-17332 Smart SQL node mapping for RO requests

Open

is depended upon by

IGNITE-17872 Fetch commit index on non-primary replicas instead of waiting for safe time in case of RO tx on idle cluster

Open

links to

GitHub Pull Request #1177

GitHub Pull Request #1265

GitHub Pull Request #1269

(1 links to)

Activity

People

Assignee:: Denis Chudov

Reporter:: Alexander Lapin

Reviewer:: Alexander Lapin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jun/22 06:52

Updated:: 09/Apr/24 10:12

Resolved:: 27/Oct/22 21:28

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m