I don't think PaxosState introduces artificial contention in the sense that we do need to serialize paxos operations, so this is definitely better than One Big Lock
Sorry that wasn't clear, I was think of contention between proposers, not contention local to a replica. Because we bucket the paxos states, we will reuse the same state for different keys. So in other words, the current algorithm does not do Paxos at the row level, but rather paxos at the level of all rows whose key's hashcode modulo 1024 is equal, and 2 proposers on 2 different row keys may compete with each other.
And since proposer contention is so damn expensive, I wonder if that artificial contention won't trump any gain we have from keeping the paxos state in memory (in cases where people have a non-trivial load of CAS operation, even though it is on different row keys). Sure we could play with the STATE_BUCKETS value, but I'm still of the opinion that it's premature optimization at this point.
On the part about serializing Paxos operations on a given replica, I agree with you that it's necessary, and for that we could absolutely use lock buckets (like we use to do for 2ndary indexes) to optimize. But we can do that without bucketing the whole paxos states.
As a side note, I think that we may want to TTL the Paxos state (we do store the MRC update after all, if someone has done an CAS on a row and don't do any CAS anymore, it would be sad to keep that MRC forever). And we can use a setting like gc_grace, since provided you do guarantee that everything older than gc_grace is consistent, then you can forget about older than gc_grace states. Of course, in practice a setting separated from gc_grace would make sense.
Would rather make them follow-on tickets
I'm fine with that. Just wanted to mention it cause I don't think we should release CAS without those tickets done. Maybe we can create a top-level ticket to group all those tickets, making sure we don't forget something important?
We wait for as many nodes as were alive at the start of the paxos request.
Fair enough, I misread that part. But, this in turn means that we'll timeout as soon as any node that we though were alive happens to be dead, even if we have a QUORUM of responses. And since the failure detector takes in general at least a handful of seconds to detect a node, it means that each time a node dies, all CAS operations will timeout until at least one other node detects it dead (and while it's true that currently some reads may timeout when a node dies, this is at least not all reads and besides, it's known to be bad for latency and one of the reason why
CASSANDRA-4705 is useful). Overall I think we should make progress as soon as we have QUORUM answers.
Furthermore, another problem is that we throw an exception if we haven't a quorum on the last MRC. But that means in turns that if amongst all live nodes we don't have a QUORUM of nodes that have learn that last MRC, we don't do anything to fix the issue (we throw an exception in fact). So if we have really lost messages, until something external magically repair the nodes that are missing the last MRC, we'll be in a state when no CAS operation on the row can proceed at all. Even believing that hinted handoff will be that external repair, which would assume that hinted handoffs are bullet-proof (and they are not), there could still be a reasonably long window of unavailability just because HH delivery is not prioritized/timely.
Lastly, I note that it's not unlikely not to get a QUORUM of node on the last MRC due to timing, because the last MRC commit may have reach some replica but not all of them (it will even likely be common when proposers contend, because the one that gets to be 2nd will do it's prepare phase more or less at the same time the previous one does his commit). And in that case, note that nothing is wrong with the cluster per se, all replica may be alive. And thus throwing an UnavailableException in that case would imo be a very unfriendly user experience.
So to recall, what I suggest instead would be to 1) only wait for QUORUM responses on prepare then 2) if not everyone is on the same last MRC, then "repair" those are aren't by committing said last MRC (with the MRC ballot in particular, not our ballot. This is not part of the Paxos algorithm per-se, this is just us re-sending the learn message of the previous round to make sure a quorum has it) and then 3) restart from the beginning (i.e. re-prepare).
I believe this does not suffer from the problems above: we prepare, and if the first quorum we get hasn't got the last MRC yet (which can be due to timing or because of some lost message, we don't know but shouldn't assume either one), we sent it to them and we start over again.
Then we can probably optimize a bit, since we know when we "repair" the replica that we're going to re-prepare just afterwards, we could do both in one message. This does complicate the logic a bit though and since that's just an optimization, this could clearly be moved to a follow-up ticket.
We don't erase the promise, and we shouldn't, because that would allow us to accept an out-of-date propose
I don't think that matters because we have the MRC to avoid accepting out-of-date proposal (and the logic already does it). In all fairness, keeping the promise is not un-correct either, it's just that I think it's useless to keep around.
Agreed, but the logic to do this gets kind of ugly
Can't we just say in PrepareCallback that if promised != true for a response, then we do a 'while (latch.getCount() > 0) latch.countDown()'? Without pretending this is the most elegant code ever, this seems simple enough to be worth it to me.
What does this buy us?
I'd say correctness . I we bother with Paxos to serialize our commits, I think we want to guarantee that the column timestamp resolution won't break that order. If I CAS more or less simultaneously 0->1 and 1->2, then if the 2nd one return true, it means it has been serialized after the first one, so I should be guaranteed to not get 1 when I read, but I could if due to clock skew or other the value 2 gets a timestamp lower than 1.
I prefer to keep ballot/proposal separate except for the MRC since promise/accept are done at different times
My reasoning is that the high-level view of the algorithm is: you prepare with some commmit ballot to see the state of the algorithm, and then you propose a commit, and commit it if it's accepted. So I still think using Commit for propose and commit would be cleaner and merge Commit and ProposeRequest which are really the same. Anyway, it's a detail I suppose.