There's a problem with my previous example. Here it is again, slightly improved:
T&S X +1 , timestamp=5
T&S X +2, timestamp=4
The two T&S are routed to A , B, C:
@A T&S X + 1
@B T&S X+2
@C T&S X+2
@A T&S X+2 <------Fail, A already updated X
@B T&S X+1 <---- fail
@C T&S X+2 <---- fail
The node that coordinated the T&S for Process1 reports a failure because 2 out of 3 T&S attempts failed.
Process2 reports success because 1 out of 3 failed, 2 succeeded.
The correct value at the node is now X+2 (from Process2)
Now, if node C dies, here's what we have on A & B:
@A: X+1, ts=5
@B: X+2, ts=4
We can't resolve this using majority (we don't do that currently anyway). The latest timestamp rule will give us the wrong answer, and we'll lose the update from Process2. Bad.
If the value at A was fixed before C went down, we wouldn't be in this trouble:
The node coordinating the T&S should always "cancel" the successful (minority) writes from a failed attempt. A simple way to implement cancelled writes would be to add a bit for that column indicating that the column is "cancelled". The cancelling write should also use T&S to avoid trampling on a newer value if the old invalid value was replaced in the meantime. If the cancelling T&S fails, we have nothing to worry about. When a node sees a cancelled column, it should replace that value with the value from a peer even if the peer node sends a value with an older timestamp. This would involve 1) a simple change to read repair logic, 2) change to the read path to make sure cancelled columns are not read (if a read results in a cancelled column, then you trigger quorum read), and 3) modifying ColumnFamily to understand cancelled writes.
The cancel-writes approach is cleaner than a forced rollback to an older value by removing this bad value from the memtable. At first glance it seems simpler, but what if the memtable got flushed in the meantime? You can't really remove this bad value from the SSTable.
What happens with Hinted-handoff:
If you don't hear back from any of the intended peers, the T&S is assumed failed. Handing off T&S doesn't make sense because it requires a success quorum. The failed node can come back up and get the correct values using read repair – business as usual.
Of course, this still leaves us with a window of vulnerability. If C goes down (in the example above) and the node running Process1 dies before it has a chance to fix A, we still end up with an invalid update. But I'm guessing this is a good tradeoff to make in a system that relies on eventual consistency. This becomes a problem if C happens to be the node running Process1. The alternative would be a protocol based on tentative writes and confirmations – that brings in a whole slew of other new semantics and other complications.
I'm going to scan the literature to see if there are alternative protocols in this consistency class – any thoughts on this approach? Any alternatives?