We update the most recent commit on requiredParticipant replicas if out of sync during the prepare round in beginAndRepairPaxos method. We keep doing this in a loop till the requiredParticipant replicas have the same most recent commit or we hit timeout.
Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do a CAS write at time 0 and it went to A and B but not to C. C will get the hint later but will not update the most recent commit in paxos table. This is how CAS hints work.
In the paxos table whose gc_grace=0, most_recent_commit in A and B will be inserted with timestamp 0 and with a TTL of 10 days. After 10 days, this insert will become a tombstone at time 0 till it is compacted away since gc_grace=0.
Do a CAS read after say 1 day on the same CQL partition and this time prepare phase involved A and C. most_recent_commit on C for this CQL partition is empty. A sends the most_recent_commit to C with a timestamp of 0 and with a TTL of 10 days. This most_recent_commit on C will expire on 11th day since it is inserted after 1 day.
most_recent_commit are now in sync on A,B and C, however A and B most_recent_commit will expire on 10th day whereas for C it will expire on 11th day since it was inserted one day later.
Do another CAS read after 10days when most_recent_commit on A and B have expired and is treated as tombstones till compacted. In this CAS read, say A and C are involved in prepare phase. most_recent_commit will not match between them since it is expired in A and is still there on C. This will cause most_recent_commit to be applied to A with a timestamp of 0 and TTL of 10 days. If A has not compacted away the original most_recent_commit which has expired, this new write to most_recent_commit wont be visible on reads since there is a tombstone with same timestamp(Delete wins over data with same timestamp).
Another round of prepare will follow and again A would say it does not know about most_recent_write(covered by original write which is not a tombstone) and C will again try to send the write to A. This can keep going on till the request timeouts or only A and B are involved in the prepare phase.
When A’s original most_recent_commit which is now a tombstone is compacted, all the inserts which it was covering will come live. This will in turn again get played to another replica. This ping pong can keep going on for a long time.
The issue is that most_recent_commit is expiring at different times across replicas. When they get replayed to a replica to make it in sync, we again set the TTL from that point.
During the CAS read which timed out, most_recent_commit was being sent to another replica in a loop. Even in successful requests, it will try to loop for a couple of times if involving A and C and then when the replicas which respond are A and B, it will succeed. So this will have impact on latencies as well.
These timeouts gets worse when a machine is down as no progress can be made as the machine with unexpired commit is always involved in the CAS prepare round. Also with range movements, the new machine gaining range has empty most recent commit and gets the commit at a later time causing same issue.
1. Paxos TTL is max(3 hours, gc_grace) as defined in SystemKeyspace.paxosTtl(). Change this method to not put a minimum TTL of 3 hours.
Method SystemKeyspace.paxosTtl() will look like return metadata.getGcGraceSeconds(); instead of return Math.max(3 * 3600, metadata.getGcGraceSeconds());
We are doing this so that we dont need to wait for 3 hours.
Create a 3 node cluster with the code change suggested above with machines A,B and C
CREATE KEYSPACE test WITH REPLICATION =
CREATE TABLE users (a int PRIMARY KEY,b int);
alter table users WITH gc_grace_seconds=120;
bring down machine C
INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS;
Nodetool flush on machine A and B
Bring up the down machine B
wait 80 seconds
Bring up machine C
select * from users where user_name = 1;
Wait 40 seconds
select * from users where user_name = 1; //All queries from this point forward will timeout.
One of the potential fixes could be to set the TTL based on the remaining time left on another replicas. This will be TTL-timestamp of write. This timestamp is calculated from ballot which uses server time.