Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Safe time for non-primary replicas (see IGNITE-17263 ) was conceived as optimization to avoid unnecessary network hops. Safe time is propagated from primary replica via raft appendEntries messages. When there is constant load on cluster that is caused by RW transactions, these messages are refreshing safe time on replicas with decent frequency, but in case of idle cluster, or cluster with read-only load, safe time is propagated periodically via heartbeats. This means that, if a RO transaction with read timestamp in present or future, is trying to read a value from non-primary replica, it will wait for safe time first, which is bound to frequency of heartbeat messages, and hence, the duration of the read operation may be close to the period of heartbeats. This looks weird and will cause performance issues.
Example:
Heartbeat period is 500 ms.
Current safe time on replica is 1.
We are processing read-only request with timestamp=2.
There were no RW transactions for some time, and the next expected update of safe time, according to the heartbeat period, is 1 + 500 = 501.
This means that we should wait for about 499 ms (assuming the clock skew and ping in cluster is 0) to proceed with RO request processing.
So, even though safe time is an optimization, we shouldn't use it in cases when there are no RW transactions affecting the given replica, and the timestamp of current RO transaction is greater than safe time. Instead of waiting for the safe time update, we should fallback to reading index from the leader to minimize the time of processing the current RO request.
To do this, we should compare the read timestamp with safe time, and if read timestamp is greater, and since the last RW transaction (affecting this replica) some time passed that is greater than some timeout (i.e. we expect that the safe time will be updated only via periodic updates) we shouldn't wait for safe time and perform read index request to leader to get the latest updates that may not have been replicated yet.
If readIndex shows that the current committed index on leader is the same that is on replica (i.e. replica doesn't expect any updates that are being replicated) it means that read-only request must wait for safe time update without further attempts to repeat readIndex operation which highly likely will be mostly useless.
We should also think about the measures to prevent extra load from replicas spamming readIndex while receiving multiple read-only requests.
Attachments
Issue Links
- depends upon
-
IGNITE-17263 Implement leader to replica safe time propagation
- Resolved