Details
-
Bug
-
Status: Triage Needed
-
Normal
-
Resolution: Unresolved
-
None
-
None
-
All
-
None
Description
The way Paxos protocol works is that if two queries attempt to update the same key from different coordinators, they will start two independent Paxos rounds. Each round will be assigned a timestamp, and the coordinator who has the highest timestamp will win.
If rounds are started at different nodes, the coordinator with lower ballot sleeps a random interval and retries.
If the key is contended this leads to a lot of retries to make an update, since most client drivers will round-robin over different coordinators.
Instead, for LWT queries the driver should choose coordinators in a pre-defined order, so that in case of contention they will queue up at the coordinator, rather than compete: choose the primary replica first, then, if the primary is known to be down, the first secondary, then the second secondary, and so on.
This will reduce contention over hot keys and thus increase LWT performance.
Unfortunately, the driver is not aware it works with an LWT statement. Identifying such statement purely on the client is also difficult: it requires parsing the statement text. LWT statements require a flag set to always include result set metadata, so most applications have the extra burden of detecting LWT and setting this flag.
So, in order to make it easy for the driver to choose the replica in a pre-defined order, and avoid contention, as well as avoid having to parse the CQL on the client to set DisableSkipMetadata flag on the query, Cassandra should return LWT flag in resultMetadata flags for LWT statements. This is a backward compatible change which can be done in any version of the server, since drivers are already ignoring unknown flags.