Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0 beta 1
    • Component/s: API, Core
    • Labels:
      None

      Description

      "Strong" consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create.

      1. half-baked commit 3.jpg
        600 kB
        Jonathan Ellis
      2. half-baked commit 2.jpg
        767 kB
        Jonathan Ellis
      3. half-baked commit 1.jpg
        585 kB
        Jonathan Ellis

        Activity

        Hide
        Jonathan Ellis added a comment -

        Possible solutions include

        1. External locking, e.g. via ZooKeeper as in Cages (http://code.google.com/p/cages/)
        2. Native locking, e.g. the wait chain algorithm now implemented in Hector (http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf)
        3. Leader election as in Spinnaker (http://arxiv.org/pdf/1103.2408.pdf)
        Show
        Jonathan Ellis added a comment - Possible solutions include External locking, e.g. via ZooKeeper as in Cages ( http://code.google.com/p/cages/ ) Native locking, e.g. the wait chain algorithm now implemented in Hector ( http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf ) Leader election as in Spinnaker ( http://arxiv.org/pdf/1103.2408.pdf )
        Hide
        Jonathan Ellis added a comment -

        My preferred option is the second:

        • Zookeeper is a nasty dependency to inflict on ourselves. It breaks the "every node is equal" design and is not a picnic to operate. (Best case, you have to manually manage compaction. Worst case, nasty corner case failure mode stories are legion.)
        • Spinnaker-style leader election is a lot of complexity, even if we roll our own leader election to avoid also requiring ZK. It's basically a completely separate read and write path to implement.

        Native locking is relatively straightforward to implement and I think it reaches the bar of "performant enough for now."

        But I prefer to expose the functionality as CAS rather than as locking, precisely because it will allow us to switch to something more performant like a Spinnaker design down the road, if necessary.

        Show
        Jonathan Ellis added a comment - My preferred option is the second: Zookeeper is a nasty dependency to inflict on ourselves. It breaks the "every node is equal" design and is not a picnic to operate. (Best case, you have to manually manage compaction. Worst case, nasty corner case failure mode stories are legion.) Spinnaker-style leader election is a lot of complexity, even if we roll our own leader election to avoid also requiring ZK. It's basically a completely separate read and write path to implement. Native locking is relatively straightforward to implement and I think it reaches the bar of "performant enough for now." But I prefer to expose the functionality as CAS rather than as locking, precisely because it will allow us to switch to something more performant like a Spinnaker design down the road, if necessary.
        Hide
        Jonathan Ellis added a comment -

        One part I don't have a good answer to yet: how do you deal with timeouts in lock-based CAS after the lock succeeds? That is: I acquire the lock, write my new user record, but that write times out. Remember that this means "write is in progress," not "write has failed," but an in-progress write might well not yet be visible for another client that also tries to CAS.

        Show
        Jonathan Ellis added a comment - One part I don't have a good answer to yet: how do you deal with timeouts in lock-based CAS after the lock succeeds? That is: I acquire the lock, write my new user record, but that write times out. Remember that this means "write is in progress," not "write has failed," but an in-progress write might well not yet be visible for another client that also tries to CAS.
        Hide
        Todd Nine added a comment -

        Can you elaborate on your timeout question with an example? I think we're on the same page with this, but wanted to be sure. There's currently a small window in which a client can think it has a lock, when it actually doesn't (timeout/2). This is due to not having any way for the lock to receive a notification when it's column has reached it's ttl and is removed because the lock heartbeat failed.

        Show
        Todd Nine added a comment - Can you elaborate on your timeout question with an example? I think we're on the same page with this, but wanted to be sure. There's currently a small window in which a client can think it has a lock, when it actually doesn't (timeout/2). This is due to not having any way for the lock to receive a notification when it's column has reached it's ttl and is removed because the lock heartbeat failed.
        Hide
        Jonathan Ellis added a comment -

        I'm actually talking about the CAS operation assuming the locks work perfectly:

        1. acquire lock
        2. check for expected value
        3. proceed with update
        4. release lock

        The problem is if the update in step 3 times out: the next client to come along may not see the update. In my account creation example, SELECT * FROM users WHERE username = X might come back empty, but still have an INSERT winding through the system from a previous operation.

        The obvious answer is, "don't release the lock until we know for sure the update succeeds," but that only works if we assume the client never fails once the lock is acquired.

        Show
        Jonathan Ellis added a comment - I'm actually talking about the CAS operation assuming the locks work perfectly: acquire lock check for expected value proceed with update release lock The problem is if the update in step 3 times out: the next client to come along may not see the update. In my account creation example, SELECT * FROM users WHERE username = X might come back empty, but still have an INSERT winding through the system from a previous operation. The obvious answer is, "don't release the lock until we know for sure the update succeeds," but that only works if we assume the client never fails once the lock is acquired.
        Hide
        T Jake Luciani added a comment -

        I implemented leased locks in Solandra a while ago similar to the wait chain stuff and found that predicting the upper bound on clock skew was hard.

        Show
        T Jake Luciani added a comment - I implemented leased locks in Solandra a while ago similar to the wait chain stuff and found that predicting the upper bound on clock skew was hard.
        Hide
        Vijay added a comment -

        I implemented Lock long ago, a better version is currently part of A6x Lock recipe (Write read and write)
        https://github.com/Netflix/astyanax/wiki/Distributed-Row-Lock, when in server it should be much simpler possibly without SSTWrites

        Show
        Vijay added a comment - I implemented Lock long ago, a better version is currently part of A6x Lock recipe (Write read and write) https://github.com/Netflix/astyanax/wiki/Distributed-Row-Lock , when in server it should be much simpler possibly without SSTWrites
        Hide
        Jonathan Ellis added a comment -

        The problem is if the update in step 3 times out: the next client to come along may not see the update.

        It's actually worse than that, because even if the update succeeds at CL.ONE, the next client still may not see the update unless we read at CL.ALL. (We can't just say "well, the write has to succed at QUORUM to call it a success," since even success to a single replica will eventually be hinted/repaired to the others.)

        If we're willing to require all replicas to be up (or we fail the CAS with UAE), then I think we can address this as follows:

        1. acquire lock (lease)
        2. check for expected value
        3. until update succeeds, retry update and renew lease
        4. release lease

        If we require the lease period to be over 2x the write request timeout (1x for the lease renewal itself, 1s for the client's update), then if the coordinator fails and his lease expires, the next CAS to acquire the lease will either:

        • See the update (reading at CL.ALL)
        • Or if not, it can assume that the update was dropped by the replicas dealing with overloaded queues

        One wrinkle: we added auto-retry to local writes in CASSANDRA-4753 to prevent dropping of hints.

        Anyway, I don't think requiring all replicas to be up is reasonable, but I include this in case it inspires something more useful.

        Show
        Jonathan Ellis added a comment - The problem is if the update in step 3 times out: the next client to come along may not see the update. It's actually worse than that, because even if the update succeeds at CL.ONE, the next client still may not see the update unless we read at CL.ALL. (We can't just say "well, the write has to succed at QUORUM to call it a success," since even success to a single replica will eventually be hinted/repaired to the others.) If we're willing to require all replicas to be up (or we fail the CAS with UAE), then I think we can address this as follows: acquire lock (lease) check for expected value until update succeeds, retry update and renew lease release lease If we require the lease period to be over 2x the write request timeout (1x for the lease renewal itself, 1s for the client's update), then if the coordinator fails and his lease expires, the next CAS to acquire the lease will either: See the update (reading at CL.ALL) Or if not, it can assume that the update was dropped by the replicas dealing with overloaded queues One wrinkle: we added auto-retry to local writes in CASSANDRA-4753 to prevent dropping of hints. Anyway, I don't think requiring all replicas to be up is reasonable, but I include this in case it inspires something more useful.
        Hide
        Jonathan Ellis added a comment -

        It seems to me that the most promising approach is Spinnaker-style leader election, but we can use an algorithm like the one for native locking above to choose leaders instead of requiring ZK.

        Show
        Jonathan Ellis added a comment - It seems to me that the most promising approach is Spinnaker-style leader election, but we can use an algorithm like the one for native locking above to choose leaders instead of requiring ZK.
        Hide
        Jonathan Ellis added a comment -

        Another approach is 2PC:

        1. Coordinator proposes a new value to replicas
        2. Live replicas reply that they have accepted the value and are ready to commit
        3. If a majority of nodes accept, Coordinator commits the new value to replicas
        4. Replicas acknowledge the commit; the operation is successful if a majority of replicas acknowledge it

        The problem here is dealing with partitions between steps 2 and 3. Suppose we have three replicas, and have arrived at the beginning of step 3. The coordinator issues the commit message, but only replica A receives it. Replicas B and C are partitioned away, and can potentially participate in a new quorum and commit a different value.

        This is okay as far as it goes – the original coordinator, that committed to A, never saw an ack quorum for the commit, and so would have had to return a TOE to the client. So we haven't broken our contract that exactly one client will see a successful CAS.

        But, what if the value committed to A has a higher timestamp than the one committed to B and C? Then when the partition heals, repair will treat A's as the correct value, and the write to B and C will be lost.

        Show
        Jonathan Ellis added a comment - Another approach is 2PC: Coordinator proposes a new value to replicas Live replicas reply that they have accepted the value and are ready to commit If a majority of nodes accept, Coordinator commits the new value to replicas Replicas acknowledge the commit; the operation is successful if a majority of replicas acknowledge it The problem here is dealing with partitions between steps 2 and 3. Suppose we have three replicas, and have arrived at the beginning of step 3. The coordinator issues the commit message, but only replica A receives it. Replicas B and C are partitioned away, and can potentially participate in a new quorum and commit a different value. This is okay as far as it goes – the original coordinator, that committed to A, never saw an ack quorum for the commit, and so would have had to return a TOE to the client. So we haven't broken our contract that exactly one client will see a successful CAS. But, what if the value committed to A has a higher timestamp than the one committed to B and C? Then when the partition heals, repair will treat A's as the correct value, and the write to B and C will be lost.
        Hide
        Jonathan Ellis added a comment - - edited

        Paxos actually addresses this problem nicely. The key thing that Paxos does that 2PC/3PC do not is, after accepting a proposal, participants will reply to future proposals from any coordinator with the highest-numbered (i.e., most recent) accepted proposal, which the coordinator will then resume progress on. Thus, partitions cannot cause multiple operations to succeed independently.

        In our example, A and at least one of

        {B,C}

        have accepted the original proposal. On partition, a new coordinator must have both B and C accept a new proposal for it to succeed, and thus will get a reply containing A's value. The new coordinator must then commit A's value.

        More on Paxos:

        Show
        Jonathan Ellis added a comment - - edited Paxos actually addresses this problem nicely. The key thing that Paxos does that 2PC/3PC do not is, after accepting a proposal, participants will reply to future proposals from any coordinator with the highest-numbered (i.e., most recent) accepted proposal, which the coordinator will then resume progress on. Thus, partitions cannot cause multiple operations to succeed independently. In our example, A and at least one of {B,C} have accepted the original proposal. On partition, a new coordinator must have both B and C accept a new proposal for it to succeed, and thus will get a reply containing A's value. The new coordinator must then commit A's value. More on Paxos: http://the-paper-trail.org/blog/consensus-protocols-paxos/ http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
        Hide
        Piotr Kołaczkowski added a comment - - edited

        +1 for Paxos.

        I particularly like the idea it doesn't rely on having one distinguished coordinator (proposer) to guarantee correctness.

        Show
        Piotr Kołaczkowski added a comment - - edited +1 for Paxos. I particularly like the idea it doesn't rely on having one distinguished coordinator (proposer) to guarantee correctness.
        Hide
        Cristian Opris added a comment - - edited

        This shouldn't be too complicated with Paxos leader election very similar to Spinnaker

        I don't think it requires changing the read/write paths at the lower level, at least not significantly.

        Assume for the sake of simplicity that we use a column prefix to encode the version

        The leader elected should always be the one that has the latest version.

        This allows the leader to perform read-modify-write (conditional update) locally and do a simple quorum write to propagate that if successful.

        The leader can also increment the version sequentially.

        Conflicting writes from other replicas cannot succeed because any node that wants to write needs to get itself elected leader first.

        Since we do quorum writes not all replicas will have the full sequence of versions but regular anti-entropy (read-repair) on quorum reads should take care of that.

        If the leader fails the newly elected leader necessarily will be the one that has the latest write so it can continue to do cas locally.

        Anti-entropy should also take care of recovery and catch-up of a replica just like now.

        I believe this can all be done on top of existing functionality without major changes to read/write paths

        You could also reuse the Zab algorithm from ZK for expediency without depending on the entire
        ZK codebase.

        Show
        Cristian Opris added a comment - - edited This shouldn't be too complicated with Paxos leader election very similar to Spinnaker I don't think it requires changing the read/write paths at the lower level, at least not significantly. Assume for the sake of simplicity that we use a column prefix to encode the version The leader elected should always be the one that has the latest version. This allows the leader to perform read-modify-write (conditional update) locally and do a simple quorum write to propagate that if successful. The leader can also increment the version sequentially. Conflicting writes from other replicas cannot succeed because any node that wants to write needs to get itself elected leader first. Since we do quorum writes not all replicas will have the full sequence of versions but regular anti-entropy (read-repair) on quorum reads should take care of that. If the leader fails the newly elected leader necessarily will be the one that has the latest write so it can continue to do cas locally. Anti-entropy should also take care of recovery and catch-up of a replica just like now. I believe this can all be done on top of existing functionality without major changes to read/write paths You could also reuse the Zab algorithm from ZK for expediency without depending on the entire ZK codebase.
        Hide
        Jonathan Ellis added a comment -

        I don't think leader-election CAS masters are quite that simple:

        1. Spinnaker requires ZK for state besides leader elections (section 4.1). Leader elections alone are easy to do without full ZK; the rest of the Spinnaker cluster state does not look so simple.
        2. Spinnaker-style (Kafka-style? I think they are basically the same) replication requires a bunch of storage engine changes as well (section 4.2).
        Show
        Jonathan Ellis added a comment - I don't think leader-election CAS masters are quite that simple: Spinnaker requires ZK for state besides leader elections (section 4.1). Leader elections alone are easy to do without full ZK; the rest of the Spinnaker cluster state does not look so simple. Spinnaker-style (Kafka-style? I think they are basically the same) replication requires a bunch of storage engine changes as well (section 4.2).
        Hide
        Sergio Bossa added a comment -

        Not sure Paxos is a good idea.

        Complexity of the implementation apart (no wonder it is very hard to find a good one around), I think we cannot use it to efficiently handle CAS: Paxos rounds get expensive with several parallel proposers (i.e., several parallel user creations), which is (one of) the reasons why many end up using Paxos only for leader election (i.e., ZK uses a Paxos-like protocol in recovery mode, and then simple 2PC for broadcasting values).
        So, if using Paxos only for leader election, why not using a simpler protocol, possibly lock-based (I know, old fashioned, but let's remember CAS is not the main Cassandra thing)?

        I may be missing something, or maybe making things too complex, or maybe I'm still jetlagged, so feel free to object

        Show
        Sergio Bossa added a comment - Not sure Paxos is a good idea. Complexity of the implementation apart (no wonder it is very hard to find a good one around), I think we cannot use it to efficiently handle CAS: Paxos rounds get expensive with several parallel proposers (i.e., several parallel user creations), which is (one of) the reasons why many end up using Paxos only for leader election (i.e., ZK uses a Paxos-like protocol in recovery mode, and then simple 2PC for broadcasting values). So, if using Paxos only for leader election, why not using a simpler protocol, possibly lock-based (I know, old fashioned, but let's remember CAS is not the main Cassandra thing)? I may be missing something, or maybe making things too complex, or maybe I'm still jetlagged, so feel free to object
        Hide
        Jonathan Ellis added a comment -

        Paxos rounds get expensive with several parallel proposers (i.e., several parallel user creations)

        Of course. But what I'm proposing is basically paxos-per-replica-set, not a single global paxos ensemble. So you get linearizability-per-partition (which is the same as you'd get with a master-per-replica-set approach). As I said above, I'm happy to accept "adequate" here.

        Complexity of the implementation apart

        However, most of that complexity is because of the two "operation modes" – as you point out, leader election/recover, and 2PC. If we just stick with "raw" paxos where any node may make a proposal, it's much simpler.

        why not using a simpler protocol, possibly lock-based

        I'd be delighted, but as outlined above, I don't see how locks help avoid the main problem around lost acks.

        Show
        Jonathan Ellis added a comment - Paxos rounds get expensive with several parallel proposers (i.e., several parallel user creations) Of course. But what I'm proposing is basically paxos-per-replica-set, not a single global paxos ensemble. So you get linearizability-per-partition (which is the same as you'd get with a master-per-replica-set approach). As I said above, I'm happy to accept "adequate" here. Complexity of the implementation apart However, most of that complexity is because of the two "operation modes" – as you point out, leader election/recover, and 2PC. If we just stick with "raw" paxos where any node may make a proposal, it's much simpler. why not using a simpler protocol, possibly lock-based I'd be delighted, but as outlined above, I don't see how locks help avoid the main problem around lost acks.
        Hide
        Piotr Kołaczkowski added a comment - - edited

        I was thinking for using Paxos to do the CAS thing, not for leader election. For Paxos it is advised not to use several parallel proposers only for performance reasons, not for correctness. So it should be actually easier to implement than guaranteeing only one leader as you'd have to do for 2PC or 3PC.

        I don't think using a lock-based protocol, assuming writes were at level < CL.ALL, is so much simpler than Paxos. I have a feeling that if you start with an assumption that you need to lock only quorum of replicas instead of all to perform an update, you'll immediately hit all the same problems that led to inventing Paxos.

        Show
        Piotr Kołaczkowski added a comment - - edited I was thinking for using Paxos to do the CAS thing, not for leader election. For Paxos it is advised not to use several parallel proposers only for performance reasons, not for correctness. So it should be actually easier to implement than guaranteeing only one leader as you'd have to do for 2PC or 3PC. I don't think using a lock-based protocol, assuming writes were at level < CL.ALL, is so much simpler than Paxos. I have a feeling that if you start with an assumption that you need to lock only quorum of replicas instead of all to perform an update, you'll immediately hit all the same problems that led to inventing Paxos.
        Hide
        Piotr Kołaczkowski added a comment -

        Jonathan Ellis +1, didn't see your post when writing mine

        Show
        Piotr Kołaczkowski added a comment - Jonathan Ellis +1, didn't see your post when writing mine
        Hide
        Cristian Opris added a comment - - edited

        Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it is their actual data storage engine.

        A few more comments:

        1. Paxos can be made very efficient particularly in stable operation scenarios.
        I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now.
        Edit: Zab requires 4 delays (2 roundtrips) actually

        2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example)

        Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica.

        Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significant performance advantages over locks. Just count how many roundtrips you'd need in the "wait chain" algorithm. Not to mentioned handling expired/orphan locks

        Show
        Cristian Opris added a comment - - edited Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it is their actual data storage engine. A few more comments: 1. Paxos can be made very efficient particularly in stable operation scenarios. I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now. Edit: Zab requires 4 delays (2 roundtrips) actually 2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example) Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica. Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significant performance advantages over locks. Just count how many roundtrips you'd need in the "wait chain" algorithm. Not to mentioned handling expired/orphan locks
        Hide
        Cristian Opris added a comment -
        Show
        Cristian Opris added a comment - See Multi-Paxos in the wikipedia article: http://en.wikipedia.org/wiki/Paxos_%28computer_science%29#Multi-Paxos
        Hide
        Cristian Opris added a comment - - edited

        So I guess what I'm proposing is similar to what Piotr said above: each CAS is a round of Paxos.
        With some cleverness this can be collapsed to Multi-Paxos.

        Spinnaker does leader election with ZK precisely because they did not want to implement Paxos themselves.

        From the paper, section 5: "The replication protocol has two phases: a leader election phase, followed by a quorum phase where the leader proposes a write and the followers accept it."

        That is Multi-Paxos, with first phase (leader election) handled by ZK and second phase being the steady state (propose/accept) with the actual write/commit
        Edit: it's not, see below

        Show
        Cristian Opris added a comment - - edited So I guess what I'm proposing is similar to what Piotr said above: each CAS is a round of Paxos. With some cleverness this can be collapsed to Multi-Paxos. Spinnaker does leader election with ZK precisely because they did not want to implement Paxos themselves. From the paper, section 5: "The replication protocol has two phases: a leader election phase, followed by a quorum phase where the leader proposes a write and the followers accept it." That is Multi-Paxos, with first phase (leader election) handled by ZK and second phase being the steady state (propose/accept) with the actual write/commit Edit: it's not, see below
        Hide
        Jonathan Ellis added a comment -

        The thing that is stopping me from hacking together a Paxos prototype is, where do we store the accepted-but-not-committed proposals? Sticking them on the "end" of the row data in an sstable would be ugly. Ignoring it and saying "paxos proposals have to fit in memory" is attractive, but then we can't remove commitlog segments until each proposal is committed or overridden.

        (Timing out accepted proposals and discarding them would put us back in the "lost ack" problem space; we could commit multiple, conflicting proposals during a network partition at the "right" time.)

        Show
        Jonathan Ellis added a comment - The thing that is stopping me from hacking together a Paxos prototype is, where do we store the accepted-but-not-committed proposals? Sticking them on the "end" of the row data in an sstable would be ugly. Ignoring it and saying "paxos proposals have to fit in memory" is attractive, but then we can't remove commitlog segments until each proposal is committed or overridden. (Timing out accepted proposals and discarding them would put us back in the "lost ack" problem space; we could commit multiple, conflicting proposals during a network partition at the "right" time.)
        Hide
        Sergio Bossa added a comment -

        So many comments, I'll try to address some:

        1) Paxos complexity.

        I believe Paxos is complex in its original form, and ZK two-phase protocol is actually a simplification, because election is only employed to establish a single proposer and then 2PC-like atomic broadcast is used; if multiple nodes can propose at the same time, it means running several Paxos instances with several proposers.
        Here's also an interesting paper about engineering problems and tradeoffs in implementing Paxos: http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/paper2-1.pdf

        2) Paxos election VS Paxos CAS

        Doing direct CAS rounds would be pretty nice, but even with partitioning, my doubts about the protocol liveness still stand.

        3) Lock-based approaches.

        The word "lock" is probably misplaced here, my fault: as it would just be a simple leader election based on placing a "lock value" on a column (kind-of file locks), and use that to declare a leader; then, monotonically increasing numbered CAS rounds would go through the leader, and could be accomplished via a simplified 2PC protocol (to avoid lost acks).
        If the leader fails, a new one will be elected among those ones with the highest committed CAS round number (to overcome partially committed rounds).
        Note: now that I write of it, this is basically Zab on top of C*, but I still believe it is much cheaper and easier than Paxos. May be wrong obviously

        Show
        Sergio Bossa added a comment - So many comments, I'll try to address some: 1) Paxos complexity. I believe Paxos is complex in its original form, and ZK two-phase protocol is actually a simplification, because election is only employed to establish a single proposer and then 2PC-like atomic broadcast is used; if multiple nodes can propose at the same time, it means running several Paxos instances with several proposers. Here's also an interesting paper about engineering problems and tradeoffs in implementing Paxos: http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/paper2-1.pdf 2) Paxos election VS Paxos CAS Doing direct CAS rounds would be pretty nice, but even with partitioning, my doubts about the protocol liveness still stand. 3) Lock-based approaches. The word "lock" is probably misplaced here, my fault: as it would just be a simple leader election based on placing a "lock value" on a column (kind-of file locks), and use that to declare a leader; then, monotonically increasing numbered CAS rounds would go through the leader, and could be accomplished via a simplified 2PC protocol (to avoid lost acks). If the leader fails, a new one will be elected among those ones with the highest committed CAS round number (to overcome partially committed rounds). Note: now that I write of it, this is basically Zab on top of C*, but I still believe it is much cheaper and easier than Paxos. May be wrong obviously
        Hide
        Cristian Opris added a comment -

        Rereading the papers, Spinnaker, Zab and Sergio's option 3) are pretty much the same thing.

        The alternative is to do a true Paxos instance for each CAS round but not clear how that can be done efficiently (and simply)

        Show
        Cristian Opris added a comment - Rereading the papers, Spinnaker, Zab and Sergio's option 3) are pretty much the same thing. The alternative is to do a true Paxos instance for each CAS round but not clear how that can be done efficiently (and simply)
        Hide
        Jonathan Ellis added a comment -

        One more data point: there is an interesting description of Hibari, which uses chain replication, at http://www.snookles.com/scott/publications/erlang2010-slf.pdf. It does not use ZK; instead, it relies on a SPOF "admin server" to coordinate chain reorganization when a block fails.

        Show
        Jonathan Ellis added a comment - One more data point: there is an interesting description of Hibari, which uses chain replication, at http://www.snookles.com/scott/publications/erlang2010-slf.pdf . It does not use ZK; instead, it relies on a SPOF "admin server" to coordinate chain reorganization when a block fails.
        Hide
        Jonathan Ellis added a comment -

        The thing that is stopping me from hacking together a Paxos prototype is, where do we store the accepted-but-not-committed proposals?

        Thinking that just dumping them to a separate file (like the pre-2.0 manifest) will be adequate to start with. We can also trade off how fine-grained the paxos ensemble is (one per row, vs one per vnode) for storage space.

        Show
        Jonathan Ellis added a comment - The thing that is stopping me from hacking together a Paxos prototype is, where do we store the accepted-but-not-committed proposals? Thinking that just dumping them to a separate file (like the pre-2.0 manifest) will be adequate to start with. We can also trade off how fine-grained the paxos ensemble is (one per row, vs one per vnode) for storage space.
        Hide
        Jonathan Ellis added a comment -

        because election is only employed to establish a single proposer and then 2PC-like atomic broadcast is used

        This is not correct for Paxos. (Not sufficiently familiar with ZAB to comment there.) Leader election is an optimization to avoid conflicts from multiple proposers from reducing performance; the leader still follows the full Paxos algorithm when proposing, to prevent the kind of lost ack problem I described above if the leader fails and is replaced. (See section 3 of "Paxos made Simple.")

        use that to declare a leader; then, monotonically increasing numbered CAS rounds would go through the leader, and could be accomplished via a simplified 2PC protocol (to avoid lost acks).

        What does this 2PC-that-avoids-lost-acks look like? Because I'm very satisfied that 2PC itself has this problem, so it's far from clear to me how you get there by "simplifying" it.

        Show
        Jonathan Ellis added a comment - because election is only employed to establish a single proposer and then 2PC-like atomic broadcast is used This is not correct for Paxos. (Not sufficiently familiar with ZAB to comment there.) Leader election is an optimization to avoid conflicts from multiple proposers from reducing performance; the leader still follows the full Paxos algorithm when proposing, to prevent the kind of lost ack problem I described above if the leader fails and is replaced. (See section 3 of "Paxos made Simple.") use that to declare a leader; then, monotonically increasing numbered CAS rounds would go through the leader, and could be accomplished via a simplified 2PC protocol (to avoid lost acks). What does this 2PC-that-avoids-lost-acks look like? Because I'm very satisfied that 2PC itself has this problem, so it's far from clear to me how you get there by "simplifying" it.
        Hide
        Sergio Bossa added a comment -

        Jonathan Ellis

        This is not correct for Paxos. (Not sufficiently familiar with ZAB to comment there)

        Right, I was talking about Zab, which does that exactly for improving liveness and performance.

        What does this 2PC-that-avoids-lost-acks look like?

        Well, given my lack of familiarity with Cassandra internals, I may be missing something here, so let's be clear about the lost-ack problem: my understanding of lost-ack is about what happens when the coordinator node sends a QUORUM request and fails before getting the ack back, causing uncertainty about the request status. So please correct me if I'm wrong here.
        But stated this way, this problem can be overcame with Zab-like 2PC: once the coordinator gets the acks from the prepare phase, it can commit without having to wait for all acks, because only committed values with the highest "commit id" will be (QUORUM) read. Then:
        1) If the coordinator fails during the prepare phase (lost ack), nothing will be committed, hence the previous committed value will be read, and if it will be hinted/repaired, it will just be a tentative value.
        2) If the coordinator fails after sending commits, the coordinator with the highest commit id will take over and "realign" followers.
        3) If a partition happens, the coordinator with the minority of followers will refuse to operate CAS (Paxos would behave exactly the same here).

        Does it make sense to you?

        Obviously I may be missing some corner case, and above all, I'm not sure about how comfortably this could be implemented in Cassandra (lack of knowledge again), so take my comments just as food for thoughts.

        Show
        Sergio Bossa added a comment - Jonathan Ellis This is not correct for Paxos. (Not sufficiently familiar with ZAB to comment there) Right, I was talking about Zab, which does that exactly for improving liveness and performance. What does this 2PC-that-avoids-lost-acks look like? Well, given my lack of familiarity with Cassandra internals, I may be missing something here, so let's be clear about the lost-ack problem: my understanding of lost-ack is about what happens when the coordinator node sends a QUORUM request and fails before getting the ack back, causing uncertainty about the request status. So please correct me if I'm wrong here. But stated this way, this problem can be overcame with Zab-like 2PC: once the coordinator gets the acks from the prepare phase, it can commit without having to wait for all acks, because only committed values with the highest "commit id" will be (QUORUM) read. Then: 1) If the coordinator fails during the prepare phase (lost ack), nothing will be committed, hence the previous committed value will be read, and if it will be hinted/repaired, it will just be a tentative value. 2) If the coordinator fails after sending commits, the coordinator with the highest commit id will take over and "realign" followers. 3) If a partition happens, the coordinator with the minority of followers will refuse to operate CAS (Paxos would behave exactly the same here). Does it make sense to you? Obviously I may be missing some corner case, and above all, I'm not sure about how comfortably this could be implemented in Cassandra (lack of knowledge again), so take my comments just as food for thoughts.
        Hide
        Cristian Opris added a comment -

        Zab is not Paxos just vaguely resembles it. Zab leader replicates a totally ordered log of idempotent operations to ALL followers. It requires a quorum of followers to acknowledge the write before committing on the leader, and then commits on the followers. When leader fails, the new leader is the one that is most up-to-date with the writes (highest log sequence number) so that one will necessarily have all the committed writes (If it does not have the commit for a particular write I believe it can assume it's been committed, I'm a bit unclear on this point).

        The new leader needs to fully synchronize all the replicas and establish a quorum before writes can resume. That may introduce a small period of unavailability.

        At least in ZK I believe clients connect to a single replica and may be behind the leader with reads but they will always see all the writes (including their own since they're forwarded to leader and replicated back) in consistent order

        Show
        Cristian Opris added a comment - Zab is not Paxos just vaguely resembles it. Zab leader replicates a totally ordered log of idempotent operations to ALL followers. It requires a quorum of followers to acknowledge the write before committing on the leader, and then commits on the followers. When leader fails, the new leader is the one that is most up-to-date with the writes (highest log sequence number) so that one will necessarily have all the committed writes (If it does not have the commit for a particular write I believe it can assume it's been committed, I'm a bit unclear on this point). The new leader needs to fully synchronize all the replicas and establish a quorum before writes can resume. That may introduce a small period of unavailability. At least in ZK I believe clients connect to a single replica and may be behind the leader with reads but they will always see all the writes (including their own since they're forwarded to leader and replicated back) in consistent order
        Hide
        Cristian Opris added a comment -

        On the other hand Paxos for each CAS would be quite different.

        The basic approach would be to have each CAS be a full Paxos round (Phase 1: prepare/promise, Phase 2: propose/accept). In this case each round is independent and writes can happen concurrently (as opposed to Zab where all writes are applied serially cluster-wide).

        There doesn't even need to be a leader, that is an optimisation to ensure liveness (avoid duelling proposers).

        Now since full Paxos is quite expensive in terms of roundtrips, there are optimisations to reduce that (see Fast Paxos in the wikipedia article) but I have yet to study the details of that.

        There is also the question of how the actual CAS op would be integrated with Paxos (who does the CAS ? presumably the proposer needs to be able to do the CAS verify locally, or maybe acceptors can NACK if the CAS is rejected locally ? Would that be a valid nack in Paxos terms ?) but that can be sorted out.

        Show
        Cristian Opris added a comment - On the other hand Paxos for each CAS would be quite different. The basic approach would be to have each CAS be a full Paxos round (Phase 1: prepare/promise, Phase 2: propose/accept). In this case each round is independent and writes can happen concurrently (as opposed to Zab where all writes are applied serially cluster-wide). There doesn't even need to be a leader, that is an optimisation to ensure liveness (avoid duelling proposers). Now since full Paxos is quite expensive in terms of roundtrips, there are optimisations to reduce that (see Fast Paxos in the wikipedia article) but I have yet to study the details of that. There is also the question of how the actual CAS op would be integrated with Paxos (who does the CAS ? presumably the proposer needs to be able to do the CAS verify locally, or maybe acceptors can NACK if the CAS is rejected locally ? Would that be a valid nack in Paxos terms ?) but that can be sorted out.
        Hide
        Cristian Opris added a comment -

        Re. which storage to use for metadata, why not use a meta-column family, like for secondary indexes, or like the locks would have required ?

        For Zab a persistent log will be necessary, and for Paxos a way to persist the paxos round state for each row.

        Show
        Cristian Opris added a comment - Re. which storage to use for metadata, why not use a meta-column family, like for secondary indexes, or like the locks would have required ? For Zab a persistent log will be necessary, and for Paxos a way to persist the paxos round state for each row.
        Hide
        Jason Brown added a comment -

        Cristian Opris Do you have a reference about ZAB? Thanks.

        Show
        Jason Brown added a comment - Cristian Opris Do you have a reference about ZAB? Thanks.
        Hide
        Sylvain Lebresne added a comment -

        once the coordinator gets the acks from the prepare phase, it can commit without having to wait for all acks

        What Jonathan means by lost ack is: what happens if when the coordinator sends the commits to replicas, but only a minority of replicas get that commit (say 1 of 3 replica got it (and persist it), the two other dies between the prepare and commit phase). And later on, the 2 replica get back up while the 3rd one now dies, and we do a new CAS (that would have a majority and so should work).

        In that case, probably the coordinator should hint something when he don't get the commit-ack from the 2 replicas that died. It could either try to revert from the 1 replica that did received the commit, but said replica can now be dead (playing devil's advocate here), so it might be a hint really. Or we rely on the commit being hinted to the 2 died replica when they come back up. But in that any case, replicas would need to check and wait for those hints when they come back up before joining the ring, and that's a problem (any other node could have hint, we don't want to wait for everyone).

        To be clear, I'm not saying Zab doesn't work or anything like that, just explaining the lost-ack problem we have identified with a naive 2PC implementation.

        Show
        Sylvain Lebresne added a comment - once the coordinator gets the acks from the prepare phase, it can commit without having to wait for all acks What Jonathan means by lost ack is: what happens if when the coordinator sends the commits to replicas, but only a minority of replicas get that commit (say 1 of 3 replica got it (and persist it), the two other dies between the prepare and commit phase). And later on, the 2 replica get back up while the 3rd one now dies, and we do a new CAS (that would have a majority and so should work). In that case, probably the coordinator should hint something when he don't get the commit-ack from the 2 replicas that died. It could either try to revert from the 1 replica that did received the commit, but said replica can now be dead (playing devil's advocate here), so it might be a hint really. Or we rely on the commit being hinted to the 2 died replica when they come back up. But in that any case, replicas would need to check and wait for those hints when they come back up before joining the ring, and that's a problem (any other node could have hint, we don't want to wait for everyone). To be clear, I'm not saying Zab doesn't work or anything like that, just explaining the lost-ack problem we have identified with a naive 2PC implementation.
        Hide
        Sergio Bossa added a comment -

        Thanks for clarifying, Sylvain Lebresne.

        what happens if when the coordinator sends the commits to replicas, but only a minority of replicas get that commit (say 1 of 3 replica got it (and persist it), the two other dies between the prepare and commit phase). And later on, the 2 replica get back up while the 3rd one now dies, and we do a new CAS (that would have a majority and so should work).

        The Zab deviation from standard 2PC here is that the coordinator doesn't need to wait for the ack from replicas on commit phase.
        If a replica fails during prepare phase, it will just be out of quorum.
        If a replica fails after prepare but before completing the commit, it will recover later from the leader: so in your example, when 2 and 3 come up, they will join the leader which may hint them the correct values.
        If the third replica died in your example was actually the coordinator, a new coordinator will be elected among the ones that have seen either the last commit or the latest proposed commit, which will become committed.
        So there's no lost-ack problem as there's actually no ack at all in the commit phase: it will be "eventually" committed or recovered.

        By the way, I'm not saying this is better than Paxos for sure: I just think this is easier and more practical (which yes doesn't mean can be implemented easily on top of Cassandra).

        Show
        Sergio Bossa added a comment - Thanks for clarifying, Sylvain Lebresne . what happens if when the coordinator sends the commits to replicas, but only a minority of replicas get that commit (say 1 of 3 replica got it (and persist it), the two other dies between the prepare and commit phase). And later on, the 2 replica get back up while the 3rd one now dies, and we do a new CAS (that would have a majority and so should work). The Zab deviation from standard 2PC here is that the coordinator doesn't need to wait for the ack from replicas on commit phase. If a replica fails during prepare phase, it will just be out of quorum. If a replica fails after prepare but before completing the commit, it will recover later from the leader: so in your example, when 2 and 3 come up, they will join the leader which may hint them the correct values. If the third replica died in your example was actually the coordinator, a new coordinator will be elected among the ones that have seen either the last commit or the latest proposed commit, which will become committed. So there's no lost-ack problem as there's actually no ack at all in the commit phase: it will be "eventually" committed or recovered. By the way, I'm not saying this is better than Paxos for sure: I just think this is easier and more practical (which yes doesn't mean can be implemented easily on top of Cassandra).
        Hide
        Jonathan Ellis added a comment - - edited

        probably the coordinator should hint something when he don't get the commit-ack from the 2 replicas that died

        This is racy, though; if the coordinator also dies, then we still lose.

        FWIW, Spinnaker's solution is actually pretty dicey here too: the leader does 2PC, and if the leader does not get a majority of acks back to it's proposal, it will return fail the op. But, it doesn't actually abort or revert the proposal on the followers. (And if it tried, it would still be open to a race, where it fails before aborting, leaving some proposals extant.)

        Then, when a new leader is elected, the new leader replays the proposals it has not yet committed. So a proposal that originally failed, and was returned as such to the client, could end up committed after failover. Which is, at best, unexpected, and in the CAS case I'm pretty sure is outright broken.

        I think Sergio's proposal has a similar problem: if the leader reports success to the client after local commit, but before it has been committed to the followers, we could either (1) lose the commit on failover if followers are pessimistic, or (2) commit data that we originally reported failed as in Spinnaker if we are optimistic. On the other hand if the leader tries to wait for commit ack from followers before reporting to the client it could block indefinitely during a partition, so that is no solution either.

        Show
        Jonathan Ellis added a comment - - edited probably the coordinator should hint something when he don't get the commit-ack from the 2 replicas that died This is racy, though; if the coordinator also dies, then we still lose. FWIW, Spinnaker's solution is actually pretty dicey here too: the leader does 2PC, and if the leader does not get a majority of acks back to it's proposal, it will return fail the op. But, it doesn't actually abort or revert the proposal on the followers. (And if it tried, it would still be open to a race, where it fails before aborting, leaving some proposals extant.) Then, when a new leader is elected, the new leader replays the proposals it has not yet committed. So a proposal that originally failed, and was returned as such to the client, could end up committed after failover. Which is, at best, unexpected, and in the CAS case I'm pretty sure is outright broken. I think Sergio's proposal has a similar problem: if the leader reports success to the client after local commit, but before it has been committed to the followers, we could either (1) lose the commit on failover if followers are pessimistic, or (2) commit data that we originally reported failed as in Spinnaker if we are optimistic. On the other hand if the leader tries to wait for commit ack from followers before reporting to the client it could block indefinitely during a partition, so that is no solution either.
        Hide
        Sergio Bossa added a comment -

        Jonathan Ellis:

        if the leader reports success to the client after local commit, but before it has been committed to the followers, we could either (1) lose the commit on failover if followers are pessimistic, or (2) commit data that we originally reported failed

        Nope, once the commit is reported as successful, even if still not acked, it will be always seen by clients because:
        1) If the leader fails, the failed over one is guaranteed to see up to the latest (proposed or committed) value, because that's the way it is elected.
        2) The only case when a commit can fail happens when the prepare phase doesn't get a quorum, but in this case the leader will have to retry until it gets it, times out or fails; last two options do not mean to the client the commit has failed, just that it has to retry (I think the same would be with Paxos, as the client is never part of the consensus).

        This is really Zab/ZK by the way, I'm not adding much to it

        Show
        Sergio Bossa added a comment - Jonathan Ellis : if the leader reports success to the client after local commit, but before it has been committed to the followers, we could either (1) lose the commit on failover if followers are pessimistic, or (2) commit data that we originally reported failed Nope, once the commit is reported as successful, even if still not acked, it will be always seen by clients because: 1) If the leader fails, the failed over one is guaranteed to see up to the latest (proposed or committed) value, because that's the way it is elected. 2) The only case when a commit can fail happens when the prepare phase doesn't get a quorum, but in this case the leader will have to retry until it gets it, times out or fails; last two options do not mean to the client the commit has failed, just that it has to retry (I think the same would be with Paxos, as the client is never part of the consensus). This is really Zab/ZK by the way, I'm not adding much to it
        Hide
        Cristian Opris added a comment - - edited
        Show
        Cristian Opris added a comment - - edited The Zab paper: http://research.yahoo.com/files/ladis08.pdf
        Hide
        Cristian Opris added a comment - - edited

        In the Zab paper in 4.1 it says "We are able to simplify the two-phase commit protocol because we do not have aborts; followers either acknowledge the leader’s proposal or they abandon the leader. The lack of aborts also mean that we can commit once a quorum of servers ack the proposal rather than waiting for all servers to respond. This simplified two- phase commit by itself cannot handle leader failures, so we will add recovery mode to handle leader failures."

        So basically once a proposal is acked by a quorum there is no going back (no abort). The leader has to succeed in committing that or else it will lose its leadership.

        If the client times out in the meantime it has to retry and find out what the result was. Presumably this can happen with regular ACID databases as well, where a client sends COMMIT TX and times out immediately after that.

        Show
        Cristian Opris added a comment - - edited In the Zab paper in 4.1 it says "We are able to simplify the two-phase commit protocol because we do not have aborts; followers either acknowledge the leader’s proposal or they abandon the leader. The lack of aborts also mean that we can commit once a quorum of servers ack the proposal rather than waiting for all servers to respond. This simplified two- phase commit by itself cannot handle leader failures, so we will add recovery mode to handle leader failures." So basically once a proposal is acked by a quorum there is no going back (no abort). The leader has to succeed in committing that or else it will lose its leadership. If the client times out in the meantime it has to retry and find out what the result was. Presumably this can happen with regular ACID databases as well, where a client sends COMMIT TX and times out immediately after that.
        Hide
        Cristian Opris added a comment - - edited

        Note that a proposal may eventually succeed on recovery even if a less than a quorum has managed to ack it before the leader fails (and the client timed out). The need for quorum writes is to be able to survive F failures out of 2F+1 replicas. Reads are not quorum, just replica local reads.

        Let's say we have 5 replicas, F1 leader, F4 and F5 are ignored here as they don't matter

        1a F1 -> proposal -> F2
        1b F1 <-  ack     <- F2
        2a F1 -> proposal -> F3
        2b F1 <-  ack     <- F3
        3a F1 ->  OK      -> client
        3b F1 -> COMMIT   -> F2,F3
        

        If F1 fails immediately after step 1b, F2 would become the leader since he has the latest seq number. Now only F2 has the proposal but it can continue and commit it to the other followers.
        If it can't get a quorum (maybe it's partitioned in a minority) then it gives up leadership. When it rejoins the majority, it runs another recovery procedure that uses epoch numbers to determine if it needs to throw away that proposal. This is fine since no client has actually been confirmed that the proposal has been committed. This is detailed in the paper.

        Show
        Cristian Opris added a comment - - edited Note that a proposal may eventually succeed on recovery even if a less than a quorum has managed to ack it before the leader fails (and the client timed out). The need for quorum writes is to be able to survive F failures out of 2F+1 replicas. Reads are not quorum, just replica local reads. Let's say we have 5 replicas, F1 leader, F4 and F5 are ignored here as they don't matter 1a F1 -> proposal -> F2 1b F1 <- ack <- F2 2a F1 -> proposal -> F3 2b F1 <- ack <- F3 3a F1 -> OK -> client 3b F1 -> COMMIT -> F2,F3 If F1 fails immediately after step 1b, F2 would become the leader since he has the latest seq number. Now only F2 has the proposal but it can continue and commit it to the other followers. If it can't get a quorum (maybe it's partitioned in a minority) then it gives up leadership. When it rejoins the majority, it runs another recovery procedure that uses epoch numbers to determine if it needs to throw away that proposal. This is fine since no client has actually been confirmed that the proposal has been committed. This is detailed in the paper.
        Hide
        Sylvain Lebresne added a comment -

        Let me first say that as far as I'm concerned, the priorities for this ticket should be:

        1. correctness (as in, it would be nice not to spend the next 4 years fixing corner cases).
        2. an implementation that doesn't make Cassandra's code some Frankenstein monster.

        You'll note that being fast is not on this list. That doesn't mean I don't care about fast, all other things being equal, faster is better. But I don't think we're targeting heavy use of conflicting CAS, not at first at least. The typical use case we're targeting initially is the one described by Jonathan, the unique user account creation problem. In that use case, 1) your application probably don't do much user creation check (comparated to the overall cluster load at least) and 2) it's fairly easy to partition the problem so that your CAS will rarely conflict (so Paxos failing at liveness is not a huge concern).

        All that to say that I'd rather start with something as simple as possible, and consider optimisations later (as Jonathan said, if we expose this as a CAS, we can switch the implementation later if need be).

        Now, for our use case, I do believe Paxos without optimisation (that does mean one complete paxos per-CAS) is simpler than Zab (to be honest, I haven't look that much at Zab, but what I've seen does seem more complicated in practice). I'm aware of papers like http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/paper2-1.pdf but their main difficulties are due to 1) multi-paxos (not what I'm suggesting) and 2) replicated log grow (it's difficult for them for reasons that don't apply to us (we fully control the data to which the replicated log applies)). Could be that I'm over-optimistic but...

        Anyway, all that to say that I'm going to give a shot at prototyping a version with a simple Paxos. I've tried to detail a concrete implementation a bit on paper and that doesn't sound too bad, we'll see. But feel free to give a shot to Zab and see how that goes

        That does lead me to a more mundane problem: how do we expose a CAS in CQL3 (thrift is simpler as it happens)?

        Show
        Sylvain Lebresne added a comment - Let me first say that as far as I'm concerned, the priorities for this ticket should be: correctness (as in, it would be nice not to spend the next 4 years fixing corner cases). an implementation that doesn't make Cassandra's code some Frankenstein monster. You'll note that being fast is not on this list. That doesn't mean I don't care about fast, all other things being equal, faster is better. But I don't think we're targeting heavy use of conflicting CAS, not at first at least. The typical use case we're targeting initially is the one described by Jonathan, the unique user account creation problem. In that use case, 1) your application probably don't do much user creation check (comparated to the overall cluster load at least) and 2) it's fairly easy to partition the problem so that your CAS will rarely conflict (so Paxos failing at liveness is not a huge concern). All that to say that I'd rather start with something as simple as possible, and consider optimisations later (as Jonathan said, if we expose this as a CAS, we can switch the implementation later if need be). Now, for our use case, I do believe Paxos without optimisation (that does mean one complete paxos per-CAS) is simpler than Zab (to be honest, I haven't look that much at Zab, but what I've seen does seem more complicated in practice). I'm aware of papers like http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/paper2-1.pdf but their main difficulties are due to 1) multi-paxos (not what I'm suggesting) and 2) replicated log grow (it's difficult for them for reasons that don't apply to us (we fully control the data to which the replicated log applies)). Could be that I'm over-optimistic but... Anyway, all that to say that I'm going to give a shot at prototyping a version with a simple Paxos. I've tried to detail a concrete implementation a bit on paper and that doesn't sound too bad, we'll see. But feel free to give a shot to Zab and see how that goes That does lead me to a more mundane problem: how do we expose a CAS in CQL3 (thrift is simpler as it happens)?
        Hide
        Jun Rao added a comment -

        To support things like CAS, the easiest way is for all writes to go to a leader replica that orders all incoming writes. There are different approaches for the leader to commit data. One approach is the quorum-based one used in Paxos, ZK and Spinnaker. The advantage of this approach is that it can hide the latency of a slow replica. The disadvantage is that for 2f+1 replicas, it only tolerates f failures (instead of 2f failures). While this is ok for ZK since it only stores state info, it's probably not ideal for systems that store real data. For that reason, in Kafka, we designed a slightly different approach for maintaining strongly consistent replicas. The details can be found in the ApacheCon presentation that I gave yesterday (http://www.slideshare.net/junrao/kafka-replication-apachecon2013). The Kafka design doesn't do paxos, but depends on ZK for leader election. So, the implementation is a bit simpler than that used in Spinnaker.

        Show
        Jun Rao added a comment - To support things like CAS, the easiest way is for all writes to go to a leader replica that orders all incoming writes. There are different approaches for the leader to commit data. One approach is the quorum-based one used in Paxos, ZK and Spinnaker. The advantage of this approach is that it can hide the latency of a slow replica. The disadvantage is that for 2f+1 replicas, it only tolerates f failures (instead of 2f failures). While this is ok for ZK since it only stores state info, it's probably not ideal for systems that store real data. For that reason, in Kafka, we designed a slightly different approach for maintaining strongly consistent replicas. The details can be found in the ApacheCon presentation that I gave yesterday ( http://www.slideshare.net/junrao/kafka-replication-apachecon2013 ). The Kafka design doesn't do paxos, but depends on ZK for leader election. So, the implementation is a bit simpler than that used in Spinnaker.
        Hide
        Jonathan Ellis added a comment -

        Thanks, Jun!

        Show
        Jonathan Ellis added a comment - Thanks, Jun!
        Hide
        Jonathan Ellis added a comment -

        I think I can summarize things this way:

        1. Embedding or requiring ZK is a non-starter. So anything that starts with that as a dependency (Spinnaker or Kafka-style replication) means "implement ZAB [or Paxos] first."
        2. A global master to designate cohort leaders as in HBase or Hibari is also a non-starter.
        3. ZAB recovery after leader failure is a mess. (I recommend the 2008 paper over the 2011 one; it's far more readable.)
        4. All of the options require some kind of "proposal storage" separate from committed rows.

        So I keep coming back to, "first implement raw Paxos; then we can use that at a building block to optimize later if CAS performance becomes more of a priority." (Proposal storage may in fact be the initial bottleneck, not the replication strategy.)

        Show
        Jonathan Ellis added a comment - I think I can summarize things this way: Embedding or requiring ZK is a non-starter. So anything that starts with that as a dependency (Spinnaker or Kafka-style replication) means "implement ZAB [or Paxos] first." A global master to designate cohort leaders as in HBase or Hibari is also a non-starter. ZAB recovery after leader failure is a mess. (I recommend the 2008 paper over the 2011 one; it's far more readable.) All of the options require some kind of "proposal storage" separate from committed rows. So I keep coming back to, "first implement raw Paxos; then we can use that at a building block to optimize later if CAS performance becomes more of a priority." (Proposal storage may in fact be the initial bottleneck, not the replication strategy.)
        Hide
        Jonathan Ellis added a comment -

        I'm working on a sketch of this approach.

        .   public static boolean cas(ByteBuffer key, ColumnFamily expected, ColumnFamily updates)
            {
                // read existing row
        
                // compare current value with expected
                if (expected does not match current)
                    return false;
        
                // apply the update!
        
            }
        

        I'm adding an optional "[time]UUID paxosBallot" to RowMutation for the proposal IDs. The updates to RMVH and callbacks should be pretty straightforward. I think Sylvain talked me into using a system CF instead of a flat file to hold uncommitted proposals.

        The "read" here must be paxos-aware to prevent our lost-ack problem: it must return both committed value and the highest accepted proposal. Making read/resolve paxos-aware is going to be the messy part.

        Since we need it for CAS anyway, I think we should also expose this read path as ConsistencyLevel.SERIAL or similar.

        Starting with Thrift because "boolean cas(...)" is straightforward. What syntax do we use to expose this to CQL? A WHERE clause to UPDATE means something different in SQL and would IMO be too confusing.

        Show
        Jonathan Ellis added a comment - I'm working on a sketch of this approach. . public static boolean cas(ByteBuffer key, ColumnFamily expected, ColumnFamily updates) { // read existing row // compare current value with expected if (expected does not match current) return false ; // apply the update! } I'm adding an optional " [time] UUID paxosBallot" to RowMutation for the proposal IDs. The updates to RMVH and callbacks should be pretty straightforward. I think Sylvain talked me into using a system CF instead of a flat file to hold uncommitted proposals. The "read" here must be paxos-aware to prevent our lost-ack problem: it must return both committed value and the highest accepted proposal. Making read/resolve paxos-aware is going to be the messy part. Since we need it for CAS anyway, I think we should also expose this read path as ConsistencyLevel.SERIAL or similar. Starting with Thrift because "boolean cas(...)" is straightforward. What syntax do we use to expose this to CQL? A WHERE clause to UPDATE means something different in SQL and would IMO be too confusing.
        Hide
        Aleksey Yeschenko added a comment -

        What syntax do we use to expose this to CQL? A WHERE clause to UPDATE means something different in SQL and would IMO be too confusing.

        IF?

        Show
        Aleksey Yeschenko added a comment - What syntax do we use to expose this to CQL? A WHERE clause to UPDATE means something different in SQL and would IMO be too confusing. IF?
        Hide
        Sylvain Lebresne added a comment -

        The problem I have relating to CQL is not so much the syntax, but the fact that a CAS is supposed to return a value (i.e. it's success), so it doesn't really fit in UPDATE in that sense.

        Show
        Sylvain Lebresne added a comment - The problem I have relating to CQL is not so much the syntax, but the fact that a CAS is supposed to return a value (i.e. it's success), so it doesn't really fit in UPDATE in that sense.
        Hide
        Aleksey Yeschenko added a comment -

        We could throw an exception if the update is not a success. Ugly-ish, but what other options are there?

        Show
        Aleksey Yeschenko added a comment - We could throw an exception if the update is not a success. Ugly-ish, but what other options are there?
        Hide
        Sylvain Lebresne added a comment -

        We could throw an exception if the update is not a success.

        The problem is, it only works if you only allow one CAS per update and nothing else. And if we're going to have that kind of restriction, maybe it's just simpler to have a specific CAS statement.

        Ugly-ish, but what other options are there?

        Well, we can perfectly return a result, something looking like:

        k | c | cas(c, 3, 4)
        --------------------
        0 | 1 | true
        

        Though returning that from an UPDATE is probably not the best idea. But if we add a specific statement for CAS, then it's perfectly OK to return that kind of result.

        Overall, I'm kind of leaning towards having a specific statement. Which could look like say:

        ATOMICALLY UPDATE foo IF c = 3 SET c = 4 WHERE ...
        

        Though for the same reasons than above, we can't really allow that in a batch.

        That's random ideas though, I just don't have a better one yet.

        Show
        Sylvain Lebresne added a comment - We could throw an exception if the update is not a success. The problem is, it only works if you only allow one CAS per update and nothing else. And if we're going to have that kind of restriction, maybe it's just simpler to have a specific CAS statement. Ugly-ish, but what other options are there? Well, we can perfectly return a result, something looking like: k | c | cas(c, 3, 4) -------------------- 0 | 1 | true Though returning that from an UPDATE is probably not the best idea. But if we add a specific statement for CAS, then it's perfectly OK to return that kind of result. Overall, I'm kind of leaning towards having a specific statement. Which could look like say: ATOMICALLY UPDATE foo IF c = 3 SET c = 4 WHERE ... Though for the same reasons than above, we can't really allow that in a batch. That's random ideas though, I just don't have a better one yet.
        Hide
        Aleksey Yeschenko added a comment -

        +1 for a new statement.

        Show
        Aleksey Yeschenko added a comment - +1 for a new statement.
        Hide
        Cristian Opris added a comment -

        Jonathan, how do you plan fitting CAS into Paxos ? Paxos would give consensus, but what would the consesus be on ? The value to write ?

        Is the CAS run at each replica or just the proposer ? How do you make sure when you run CAS locally you have actually learned the previous consensus value (to compare expected with) ?

        Show
        Cristian Opris added a comment - Jonathan, how do you plan fitting CAS into Paxos ? Paxos would give consensus, but what would the consesus be on ? The value to write ? Is the CAS run at each replica or just the proposer ? How do you make sure when you run CAS locally you have actually learned the previous consensus value (to compare expected with) ?
        Hide
        Cristian Opris added a comment -

        I believe UPDATE statements in SQL return the number of rows affected. You could do the same here (for however you define "row" in CQL)

        Show
        Cristian Opris added a comment - I believe UPDATE statements in SQL return the number of rows affected. You could do the same here (for however you define "row" in CQL)
        Hide
        Jonathan Ellis added a comment -

        A more-complete sketch of the coordinator side:

        .   /**
             * Apply @param updates if and only if the current values in the row for @param key
             * match the ones given by @param old.  The algorithm is "raw" Paxos: that is, Paxos
             * minus leader election -- any node in the cluster may propose changes for any row,
             * which (that is, the row) is the unit of values being proposes, not single columns.
             *
             * The Paxos cohort is only the replicas for the given key, not the entire cluster.
             * So we expect performance to be reasonable, but CAS is still intended to be used
             * "when you really need it," not for all your updates.
             *
             * There are three phases to Paxos:
             *  1. Prepare: the coordinator generates a ballot (timeUUID in our case) and asks replicas to (a) promise
             *     not to accept updates from older ballots and (b) tell us about the most recent update it has already
             *     accepted.
             *  2. Accept: if a majority of replicas reply, the coordinator asks replicas to accept the value of the
             *     highest proposal ballot it heard about, or a new value if no in-progress proposals were reported.
             *  3. Commit (Learn): if a majority of replicas acknowledge the accept request, we can commit the new
             *     value.
             *
             *  Commit procedure is not covered in "Paxos Made Simple," and only briefly mentioned in "Paxos Made Live,"
             *  so here is our approach:
             *   3a. The coordinator sends a commit message to all replicas with the ballot and value.
             *   3b. Because of 1-2, this will be the highest-seen commit ballot.  The replicas will note that,
             *       and send it with subsequent promise replies.  This allows us to discard acceptance records
             *       for successfully committed replicas, without allowing incomplete proposals to commit erroneously
             *       later on.
             *
             *  Note that since we are performing a CAS rather than a simple update, we perform a read (of committed
             *  values) between the prepare and accept phases.  This gives us a slightly longer window for another
             *  coordinator to come along and trump our own promise with a newer one but is otherwise safe.
             *
             * @return true if the operation succeeds in updating the row
             */
            public static boolean cas(String table, ByteBuffer key, ColumnFamily expected, ColumnFamily updates)
            {
                // begin a paxos round
                UUID paxosBallot = UUIDGen.getTimeUUID();
                Token tk = StorageService.getPartitioner().getToken(key);
                List<InetAddress> naturalEndpoints = StorageService.instance.getNaturalEndpoints(table, tk);
                Collection<InetAddress> pendingEndpoints = StorageService.instance.getTokenMetadata().pendingEndpointsFor(tk, table);
                Iterable<InetAddress> allEndpoints = Iterables.concat(naturalEndpoints, pendingEndpoints);
                Map<InetAddress, PaxosPrepareResponse> inProgressProposals = preparePaxos(paxosBallot, allEndpoints);
        
                // if we didn't hear from a majority of replicas, we have to bail for now
                int quorum = 1 + Iterables.size(allEndpoints) / 2;
                if (inProgressProposals.size() < quorum)
                    throw new UnavailableException(ConsistencyLevel.SERIAL, quorum, inProgressProposals.size());
        
                // summarize the responses
                UUID mostRecentCommitted = UUIDGen.minTimeUUID(0);
                UUID inProgressBallot = UUIDGen.minTimeUUID(0);
                ColumnFamily inProgressUpdates;
                for (PaxosPrepareResponse response : inProgressProposals.values())
                {
                    if (!response.promised)
                    {
                        // technically, we can proceed if we get a majority of successful promises,
                        // but if a higher proposal number already exists even on one replica,
                        // chances are it will be on more by the time we send a value to accept
                        return false;
                    }
        
                    if (timeComparator.compare(response.mostRecentCommitted, mostRecentCommitted) > 0)
                        mostRecentCommitted = response.mostRecentCommitted;
                    if (timeComparator.compare(response.inProgressBallot, inProgressBallot) > 0)
                    {
                        inProgressBallot = response.inProgressBallot;
                        inProgressUpdates = response.inProgressUpdates;
                    }
                }
        
                // complete earlier, in-progress rounds if necessary
                if (inProgressUpdates != null && timeComparator.compare(inProgressBallot, mostRecentCommitted) >= 0)
                {
                    RowMutation rm = new RowMutation(table, key, inProgressUpdates);
                    if (acceptPaxos(paxosBallot, allEndpoints, rm))
                        commitPaxos(paxosBallot, allEndpoints, rm);
                    return false;
                }
        
                // read the current value
                ReadCommand readCommand = new SliceByNamesReadCommand(filter for expected);
                List<Row> rows = read(readCommand, ConsistencyLevel.QUORUM);
        
                // compare current value with expected
                if (expected does not match current)
                    return false;
        
                // finish the paxos round w/ the desired updates
                RowMutation rm = new RowMutation(table, key, updates);
                RowMutation rm = new RowMutation(table, key, inProgressUpdates);
                if (acceptPaxos(paxosBallot, allEndpoints, rm))
                {
                    commitPaxos(paxosBallot, allEndpoints, rm);
                    return true;
                }
                return false;
            }
        
            private static class PaxosPrepareResponse
            {
                public final boolean promised;
                public final UUID mostRecentCommitted;
                public final UUID inProgressBallot;
                public final ColumnFamily inProgressUpdates;
            }
        
        Show
        Jonathan Ellis added a comment - A more-complete sketch of the coordinator side: . /** * Apply @param updates if and only if the current values in the row for @param key * match the ones given by @param old. The algorithm is "raw" Paxos: that is, Paxos * minus leader election -- any node in the cluster may propose changes for any row, * which (that is, the row) is the unit of values being proposes, not single columns. * * The Paxos cohort is only the replicas for the given key, not the entire cluster. * So we expect performance to be reasonable, but CAS is still intended to be used * "when you really need it," not for all your updates. * * There are three phases to Paxos: * 1. Prepare: the coordinator generates a ballot (timeUUID in our case ) and asks replicas to (a) promise * not to accept updates from older ballots and (b) tell us about the most recent update it has already * accepted. * 2. Accept: if a majority of replicas reply, the coordinator asks replicas to accept the value of the * highest proposal ballot it heard about, or a new value if no in-progress proposals were reported. * 3. Commit (Learn): if a majority of replicas acknowledge the accept request, we can commit the new * value. * * Commit procedure is not covered in "Paxos Made Simple," and only briefly mentioned in "Paxos Made Live," * so here is our approach: * 3a. The coordinator sends a commit message to all replicas with the ballot and value. * 3b. Because of 1-2, this will be the highest-seen commit ballot. The replicas will note that, * and send it with subsequent promise replies. This allows us to discard acceptance records * for successfully committed replicas, without allowing incomplete proposals to commit erroneously * later on. * * Note that since we are performing a CAS rather than a simple update, we perform a read (of committed * values) between the prepare and accept phases. This gives us a slightly longer window for another * coordinator to come along and trump our own promise with a newer one but is otherwise safe. * * @ return true if the operation succeeds in updating the row */ public static boolean cas( String table, ByteBuffer key, ColumnFamily expected, ColumnFamily updates) { // begin a paxos round UUID paxosBallot = UUIDGen.getTimeUUID(); Token tk = StorageService.getPartitioner().getToken(key); List<InetAddress> naturalEndpoints = StorageService.instance.getNaturalEndpoints(table, tk); Collection<InetAddress> pendingEndpoints = StorageService.instance.getTokenMetadata().pendingEndpointsFor(tk, table); Iterable<InetAddress> allEndpoints = Iterables.concat(naturalEndpoints, pendingEndpoints); Map<InetAddress, PaxosPrepareResponse> inProgressProposals = preparePaxos(paxosBallot, allEndpoints); // if we didn't hear from a majority of replicas, we have to bail for now int quorum = 1 + Iterables.size(allEndpoints) / 2; if (inProgressProposals.size() < quorum) throw new UnavailableException(ConsistencyLevel.SERIAL, quorum, inProgressProposals.size()); // summarize the responses UUID mostRecentCommitted = UUIDGen.minTimeUUID(0); UUID inProgressBallot = UUIDGen.minTimeUUID(0); ColumnFamily inProgressUpdates; for (PaxosPrepareResponse response : inProgressProposals.values()) { if (!response.promised) { // technically, we can proceed if we get a majority of successful promises, // but if a higher proposal number already exists even on one replica, // chances are it will be on more by the time we send a value to accept return false ; } if (timeComparator.compare(response.mostRecentCommitted, mostRecentCommitted) > 0) mostRecentCommitted = response.mostRecentCommitted; if (timeComparator.compare(response.inProgressBallot, inProgressBallot) > 0) { inProgressBallot = response.inProgressBallot; inProgressUpdates = response.inProgressUpdates; } } // complete earlier, in-progress rounds if necessary if (inProgressUpdates != null && timeComparator.compare(inProgressBallot, mostRecentCommitted) >= 0) { RowMutation rm = new RowMutation(table, key, inProgressUpdates); if (acceptPaxos(paxosBallot, allEndpoints, rm)) commitPaxos(paxosBallot, allEndpoints, rm); return false ; } // read the current value ReadCommand readCommand = new SliceByNamesReadCommand(filter for expected); List<Row> rows = read(readCommand, ConsistencyLevel.QUORUM); // compare current value with expected if (expected does not match current) return false ; // finish the paxos round w/ the desired updates RowMutation rm = new RowMutation(table, key, updates); RowMutation rm = new RowMutation(table, key, inProgressUpdates); if (acceptPaxos(paxosBallot, allEndpoints, rm)) { commitPaxos(paxosBallot, allEndpoints, rm); return true ; } return false ; } private static class PaxosPrepareResponse { public final boolean promised; public final UUID mostRecentCommitted; public final UUID inProgressBallot; public final ColumnFamily inProgressUpdates; }
        Hide
        Sylvain Lebresne added a comment -

        Paxos is use here to have replicas agree on basically the value post-next-cas. I.e. a proposer proposes it's values as if the cas was ok. If the replica tell us that our proposal (so our cas) don't win anyway, we bail out (which I don't think is the right thing to do, but that's a detail for now), otherwise we check our cas does indeed apply and we send accepts. Now, paxos guarantees basically that once a majority of replica have accepted one value, only this value can ever be accepted, even by other proposer.

        So far, raw paxos only allow us to do one CAS (per row-key) ever. I.e. paxos allows to agree on what is the first cas to go it but that's it. To be able to continue, we need to restart a completely new instance of paxos. How does that happen exactly in your sketch? I think you elude to that in you comment on the commit phase with 3b, where you "discard acceptance records for successfully committed replicas", which I understand more or less as, "once we've committed the first paxos result, we forget all state and start afresh a new paxos instance". But this is not obviously correct to me. At the very least, it seems to me that you shouldn't start forgetting about the previous instance of Paxos before you've make sure a majority of replica have learn about the commit result.

        So basically I don't understand how the commit part work and more generally how do you go from one round to Paxos to the next (where by round, I mean a full instance of the paxos algorithm) since you don't seem to formalize the notion of round.

        Show
        Sylvain Lebresne added a comment - Paxos is use here to have replicas agree on basically the value post-next-cas. I.e. a proposer proposes it's values as if the cas was ok. If the replica tell us that our proposal (so our cas) don't win anyway, we bail out (which I don't think is the right thing to do, but that's a detail for now), otherwise we check our cas does indeed apply and we send accepts. Now, paxos guarantees basically that once a majority of replica have accepted one value, only this value can ever be accepted, even by other proposer. So far, raw paxos only allow us to do one CAS (per row-key) ever. I.e. paxos allows to agree on what is the first cas to go it but that's it. To be able to continue, we need to restart a completely new instance of paxos. How does that happen exactly in your sketch? I think you elude to that in you comment on the commit phase with 3b, where you "discard acceptance records for successfully committed replicas", which I understand more or less as, "once we've committed the first paxos result, we forget all state and start afresh a new paxos instance". But this is not obviously correct to me. At the very least, it seems to me that you shouldn't start forgetting about the previous instance of Paxos before you've make sure a majority of replica have learn about the commit result. So basically I don't understand how the commit part work and more generally how do you go from one round to Paxos to the next (where by round, I mean a full instance of the paxos algorithm) since you don't seem to formalize the notion of round.
        Hide
        Jonathan Ellis added a comment -

        To be able to continue, we need to restart a completely new instance of paxos.

        Yes, that's the tricky part, and none of the papers go into detail here. I think the outline here will work, but I'm open to better ideas.

        At the very least, it seems to me that you shouldn't start forgetting about the previous instance of Paxos before you've make sure a majority of replica have learn about the commit result.

        "Commit" means "write it to the 'main' CF, update mostRecentCommitted, and discard the proposal records." (Prepare/Accept/Commit are all commitlog'd; normal caveats apply if you don't run with Batch mode.)

        I think you're correct that we need to wait for a majority to agree on mostRecentCommitted before proceeding.

        Show
        Jonathan Ellis added a comment - To be able to continue, we need to restart a completely new instance of paxos. Yes, that's the tricky part, and none of the papers go into detail here. I think the outline here will work, but I'm open to better ideas. At the very least, it seems to me that you shouldn't start forgetting about the previous instance of Paxos before you've make sure a majority of replica have learn about the commit result. "Commit" means "write it to the 'main' CF, update mostRecentCommitted, and discard the proposal records." (Prepare/Accept/Commit are all commitlog'd; normal caveats apply if you don't run with Batch mode.) I think you're correct that we need to wait for a majority to agree on mostRecentCommitted before proceeding.
        Hide
        Sylvain Lebresne added a comment -

        I think you're correct that we need to wait for a majority to agree on mostRecentCommitted before proceeding.

        I certainly don't fully understand your idea, but Paxos doesn't guarantee that only one proposer will commit. So while you wait on a majority to agree on mostRecentCommitted (to erase the state and start the next round of paxos), any concurrent CAS may end up committing (the guarantee of paxos being that the value committed will be the same one than any other commit for that round), which will change the "mostRecentCommitted".

        Yes, that's the tricky part, and none of the papers go into detail here.

        I agree on both count.

        Let me try to describe a bit my own current reasoning on this issue (which mainly correspond to my understanding of what "Paxos made live" describe (even though, they do don't go into very much detail)). Maybe that way you can tell me how your idea differ.

        So, what we're going to do is successive rounds of paxos. So we make the round explicit. Concretely, we have a round number that is shipped with every message, and each round is a completely separated instance of the paxos algorithm. And so each round will yield a value. The idea is that this way, we build a consistent and replicated log of what operation to apply in which order (operation that we will apply to the data in said order, that the listening part, but paxos is here to ensure consistency of the log of operation). I.e. round N is about "let's agree on what's the Nth operation we'll apply on the data".

        Concretely, the first thing a proposer does is to fetch his last known committed round N (meaning, during commit, on top of actually applying the value/operation agreed on, we also basically write that round N is committed). Then it start a normal paxos for round N+1 (with whatever value/operation he want to get in).

        You do the full round of paxos and one operation wins. As said above, the "listening" part just consist in recording that this is the value for round N and to apply it to the actual CF.

        Note that you don't have to wait for anything during commit in that case. It's ok for a proposer to start a proposal on round N even though that round has been committed already (but the proposer is not aware yet). It's ok because Paxos guarantees said proposer will learn the value committed for this round N (and will commit it itself in fact). After which it will know N is committed and will be free to start a new proposal for round N+1 with the actual value/operation he wanted to get in in the first place.

        That's actually what happens to a node that fails. When it's back up, it might not be up to date on the most recent round. But that's ok, it will try some old round, will learn the value committed for said round (not his own since it's behind on what's committed), commit it (which has mainly the effect of learning the value locally, it will be old news for other replica) and continue like that until he catches up.

        That's the basic idea. Now there is 2 things that are clearly not very practical:

        1. this suppose we keep the replicated log of operation (i.e. list of which operation has been agreed on for each round) forever. And that's ever growing.
        2. the part above about back to life replica catching up by redoing all missed round of paxos one by one is obviously a bit lame.

        I think both problem are linked, and really are the issue discussed in "Paxos made live" in session 5.5-Snapshots. Now, the good news is that they have been able to optimize both of those in practice without breaking correctness, so it's possible . That being said, their snapshot idea sound fairly heavy-weight. But I think it should be possible to do simpler as we control the data (which they don't and explicitely cite as a difficulty). I have a few idea, but honestly they are not fully though trough yet (though let's say that 1) is the hard part imo. For 2) you can have other replica "repair" a proposer that starts a proposal on an old round so it catches up right away).

        Now, if I ignore 1) and 2) for a minute, I think I understand the principle enough to be convinced that it is correct. Also, this is really just using paxos to build a consistent and replicated log of the operation applied, but that log is separated from the data itself. So while this does mean we would have to basically have paxos columns for which all update goes through paxos, this also means that there is nothing to do on the read path itself.

        Obviously, 1) and 2) need to be optimized for the idea to be realistic, and this without breaking correctness. And while I have ideas, I admit I haven't though all the details through. Still, I wanted to dump my current reasoning on this out.

        I note that it could be that Jonathan's proposal is in fact just the same thing, but where 1) and 2) are optimized so well that there is no real history to keep at all and so no real need to keep track of rounds per se. Which would be great, but if that's the case I think we do need to really understand how theses optimizations work and convince ourselves that it does preserve correctness. I'd personaly need more detail for that

        Show
        Sylvain Lebresne added a comment - I think you're correct that we need to wait for a majority to agree on mostRecentCommitted before proceeding. I certainly don't fully understand your idea, but Paxos doesn't guarantee that only one proposer will commit. So while you wait on a majority to agree on mostRecentCommitted (to erase the state and start the next round of paxos), any concurrent CAS may end up committing (the guarantee of paxos being that the value committed will be the same one than any other commit for that round), which will change the "mostRecentCommitted". Yes, that's the tricky part, and none of the papers go into detail here. I agree on both count. Let me try to describe a bit my own current reasoning on this issue (which mainly correspond to my understanding of what "Paxos made live" describe (even though, they do don't go into very much detail)). Maybe that way you can tell me how your idea differ. So, what we're going to do is successive rounds of paxos. So we make the round explicit. Concretely, we have a round number that is shipped with every message, and each round is a completely separated instance of the paxos algorithm. And so each round will yield a value. The idea is that this way, we build a consistent and replicated log of what operation to apply in which order (operation that we will apply to the data in said order, that the listening part, but paxos is here to ensure consistency of the log of operation). I.e. round N is about "let's agree on what's the Nth operation we'll apply on the data". Concretely, the first thing a proposer does is to fetch his last known committed round N (meaning, during commit, on top of actually applying the value/operation agreed on, we also basically write that round N is committed). Then it start a normal paxos for round N+1 (with whatever value/operation he want to get in). You do the full round of paxos and one operation wins. As said above, the "listening" part just consist in recording that this is the value for round N and to apply it to the actual CF. Note that you don't have to wait for anything during commit in that case. It's ok for a proposer to start a proposal on round N even though that round has been committed already (but the proposer is not aware yet). It's ok because Paxos guarantees said proposer will learn the value committed for this round N (and will commit it itself in fact). After which it will know N is committed and will be free to start a new proposal for round N+1 with the actual value/operation he wanted to get in in the first place. That's actually what happens to a node that fails. When it's back up, it might not be up to date on the most recent round. But that's ok, it will try some old round, will learn the value committed for said round (not his own since it's behind on what's committed), commit it (which has mainly the effect of learning the value locally, it will be old news for other replica) and continue like that until he catches up. That's the basic idea. Now there is 2 things that are clearly not very practical: this suppose we keep the replicated log of operation (i.e. list of which operation has been agreed on for each round) forever. And that's ever growing. the part above about back to life replica catching up by redoing all missed round of paxos one by one is obviously a bit lame. I think both problem are linked, and really are the issue discussed in "Paxos made live" in session 5.5-Snapshots. Now, the good news is that they have been able to optimize both of those in practice without breaking correctness, so it's possible . That being said, their snapshot idea sound fairly heavy-weight. But I think it should be possible to do simpler as we control the data (which they don't and explicitely cite as a difficulty). I have a few idea, but honestly they are not fully though trough yet (though let's say that 1) is the hard part imo. For 2) you can have other replica "repair" a proposer that starts a proposal on an old round so it catches up right away). Now, if I ignore 1) and 2) for a minute, I think I understand the principle enough to be convinced that it is correct. Also, this is really just using paxos to build a consistent and replicated log of the operation applied, but that log is separated from the data itself. So while this does mean we would have to basically have paxos columns for which all update goes through paxos, this also means that there is nothing to do on the read path itself. Obviously, 1) and 2) need to be optimized for the idea to be realistic, and this without breaking correctness. And while I have ideas, I admit I haven't though all the details through. Still, I wanted to dump my current reasoning on this out. I note that it could be that Jonathan's proposal is in fact just the same thing, but where 1) and 2) are optimized so well that there is no real history to keep at all and so no real need to keep track of rounds per se. Which would be great, but if that's the case I think we do need to really understand how theses optimizations work and convince ourselves that it does preserve correctness. I'd personaly need more detail for that
        Hide
        Cristian Opris added a comment -

        There is this paper that might be of interest, Consensus on Transaction Commit:
        http://research.microsoft.com/apps/pubs/default.aspx?id=64636

        I haven't yet studied it in detail but may give some ideas.

        "Paxos made live" seems centered on the idea of having a replicated log. Not sure this applies to what we want to do. There are details on how to do that however in the papers cited, the more relevant I think:

        Lampson, B. W. How to build a highly available system using consensus.
        Schneider, F. B. Implementing fault-tolerant services using the state machine approach: A tutorial.

        Google has links to the papers

        Show
        Cristian Opris added a comment - There is this paper that might be of interest, Consensus on Transaction Commit: http://research.microsoft.com/apps/pubs/default.aspx?id=64636 I haven't yet studied it in detail but may give some ideas. "Paxos made live" seems centered on the idea of having a replicated log. Not sure this applies to what we want to do. There are details on how to do that however in the papers cited, the more relevant I think: Lampson, B. W. How to build a highly available system using consensus. Schneider, F. B. Implementing fault-tolerant services using the state machine approach: A tutorial. Google has links to the papers
        Hide
        Jonathan Ellis added a comment -

        Attached some diagrams for what I am talking about.

        1. shows a no-failures round. (N+1 looks exactly the same, so omitted.)
        2. shows how mostRecentCommitted saves us from "resurrecting" and aborted write
        3. seems to imply that we still need to keep a full log of all operations the way Sylvain suggests, which makes me sad
        Show
        Jonathan Ellis added a comment - Attached some diagrams for what I am talking about. shows a no-failures round. (N+1 looks exactly the same, so omitted.) shows how mostRecentCommitted saves us from "resurrecting" and aborted write seems to imply that we still need to keep a full log of all operations the way Sylvain suggests, which makes me sad
        Hide
        Sergio Bossa added a comment -

        I think we are deviating from the raw Paxos protocol in a number of ways.

        Paxos is a consensus protocol to have distributed processes agree on any value, meaning that a proposer is not bound to its own value, but our use case is a bit different: the proposer is actually bound to its own value and the latest accepted value, because the cas operation depends on both; it is true in your sketch you eventually commit other higher proposals and return false, but that's not enough, as there may be races between accept/commit phases.

        In other words, Paxos has no commit phase as everything happens during accept; the proposer sends the accept request, and the acceptors reply with ok/notok, that's it: if it is not ok (meaning the accepted value came from another proposer), and the proposer cares about its value to be accepted (our case!), another round will have to start.

        Reading/Learning the value happens at read time: a quorum read is issued and the highest accepted proposal is taken (and if there's no quorum, it errors out).

        This implies the following to me:
        1) Liveness is sacrificed, but we don't care (fair enough).
        2) No need for the commit phase: everything happens during the accept phase.
        3) No need for the op log: during fail-recovery, older proposals will be discarded and the newest ones can just be "learnt" via a quorum read.

        I just think we should track kind-of epochs for Paxos rounds, to avoid old inflight (hinted?) rounds to "leak into" newer ones (or maybe the proposal uuid is enough?).

        This actually sounds simpler than Zab (our use case is pretty limited for now), so I'm agreeing with the Paxos choice

        Or am I missing something?

        Show
        Sergio Bossa added a comment - I think we are deviating from the raw Paxos protocol in a number of ways. Paxos is a consensus protocol to have distributed processes agree on any value, meaning that a proposer is not bound to its own value, but our use case is a bit different: the proposer is actually bound to its own value and the latest accepted value, because the cas operation depends on both; it is true in your sketch you eventually commit other higher proposals and return false, but that's not enough, as there may be races between accept/commit phases. In other words, Paxos has no commit phase as everything happens during accept; the proposer sends the accept request, and the acceptors reply with ok/notok, that's it: if it is not ok (meaning the accepted value came from another proposer), and the proposer cares about its value to be accepted (our case!), another round will have to start. Reading/Learning the value happens at read time: a quorum read is issued and the highest accepted proposal is taken (and if there's no quorum, it errors out). This implies the following to me: 1) Liveness is sacrificed, but we don't care (fair enough). 2) No need for the commit phase: everything happens during the accept phase. 3) No need for the op log: during fail-recovery, older proposals will be discarded and the newest ones can just be "learnt" via a quorum read. I just think we should track kind-of epochs for Paxos rounds, to avoid old inflight (hinted?) rounds to "leak into" newer ones (or maybe the proposal uuid is enough?). This actually sounds simpler than Zab (our use case is pretty limited for now), so I'm agreeing with the Paxos choice Or am I missing something?
        Hide
        Cristian Opris added a comment -

        Luis Sergio Faria Carneiro What may happen with this is you read a value from the most advanced replica and then you try a CAS at a stale replica which will deny it even if it's legit, because it does not match its stale value.

        I think something like this may work where you track a version counter for each row and you make sure you advance paxos rounds (and version counter) one at a time per quorum.

        Basically the invariant is that a replica initiates or participates in paxos round V only after
        it has committed V-1 locally, which can happen when:

        • it learns a majority has accepted a value at V-1 so it can commit V-1 locally (i.e. paxos round V-1 is settled)
        • it learns that any replica has committed V-1

        I am still fuzzy how this can be accomplished exactly but the invariants seem good.

        Show
        Cristian Opris added a comment - Luis Sergio Faria Carneiro What may happen with this is you read a value from the most advanced replica and then you try a CAS at a stale replica which will deny it even if it's legit, because it does not match its stale value. I think something like this may work where you track a version counter for each row and you make sure you advance paxos rounds (and version counter) one at a time per quorum. Basically the invariant is that a replica initiates or participates in paxos round V only after it has committed V-1 locally, which can happen when: it learns a majority has accepted a value at V-1 so it can commit V-1 locally (i.e. paxos round V-1 is settled) it learns that any replica has committed V-1 I am still fuzzy how this can be accomplished exactly but the invariants seem good.
        Hide
        Cristian Opris added a comment -

        Note that the version counter is per row, and this would only require keeping the last committed and the last accepted values for each row (no log necessary)

        Show
        Cristian Opris added a comment - Note that the version counter is per row, and this would only require keeping the last committed and the last accepted values for each row (no log necessary)
        Hide
        Jonathan Ellis added a comment -

        I think we can fix the "partial commit" problem in my diagram 3. The key is forcing CAS updates to occur with strictly increasing column (cell) timestamps. Then, we can rely on standard "use the newest value" read-repair. Specifically:

        1. Instead of replicas checking raw timeuuid order for propose/promise, they will require that a new ballot be larger in the time component than an accepted ballot or mostRecentCommitted. (We do still want to use a timeuuid value though instead of a raw timestamp to guarantee uniqueness across proposals.)
        2. Coordinators will generate ballots from the min timestamp in the new columns being proposed. Thus, any committed proposal will have a higher timestamp than any previously committed one.

        The good:

        1. No need for a hairy mess of CAS ballot order trumping timestamp order during HH/AES/RR. Much easier if ballot/timestamp order are the same.
        2. Non-CAS ops can be mixed in with CAS ones with sane results (as long as potentially concurrent ones are all CAS, of course).

        The bad:

        1. Just the obvious (big) one: we're rate-limited by both clock resolution and clock skew. But, this is reasonable for our goals for 2.0. And I'm not actually sure it's even possible to avoid, if we want to allow mixing CAS and non-CAS ops in the same CF (see "hairy mess" above).

        Notes:

        1. Rejecting "newer" proposals w/ equal time components slows us down (we return false and client has to try again w/ a newer ballot) but does not compromise correctness.
        2. Different CAS ops may operate on different sets of columns, so because we RR for one CAS op does not mean that we've caught up the affected replicas entirely, but it does mean we've caught them up for the columns being checked on this time, which is what we care about.
        Show
        Jonathan Ellis added a comment - I think we can fix the "partial commit" problem in my diagram 3. The key is forcing CAS updates to occur with strictly increasing column (cell) timestamps. Then, we can rely on standard "use the newest value" read-repair. Specifically: Instead of replicas checking raw timeuuid order for propose/promise, they will require that a new ballot be larger in the time component than an accepted ballot or mostRecentCommitted. (We do still want to use a timeuuid value though instead of a raw timestamp to guarantee uniqueness across proposals.) Coordinators will generate ballots from the min timestamp in the new columns being proposed. Thus, any committed proposal will have a higher timestamp than any previously committed one. The good: No need for a hairy mess of CAS ballot order trumping timestamp order during HH/AES/RR. Much easier if ballot/timestamp order are the same. Non-CAS ops can be mixed in with CAS ones with sane results (as long as potentially concurrent ones are all CAS, of course). The bad: Just the obvious (big) one: we're rate-limited by both clock resolution and clock skew. But, this is reasonable for our goals for 2.0. And I'm not actually sure it's even possible to avoid, if we want to allow mixing CAS and non-CAS ops in the same CF (see "hairy mess" above). Notes: Rejecting "newer" proposals w/ equal time components slows us down (we return false and client has to try again w/ a newer ballot) but does not compromise correctness. Different CAS ops may operate on different sets of columns, so because we RR for one CAS op does not mean that we've caught up the affected replicas entirely, but it does mean we've caught them up for the columns being checked on this time, which is what we care about.
        Hide
        Jonathan Ellis added a comment -

        Sergio, the reason to go beyond pure Paxos by turning Learn into Commit is twofold:

        1. To allow a fresh start for a new round of CAS; otherwise, each new proposal is stuck re-issuing the first accepted value
        2. To move the data from the Paxos "maybe it's accepted but we don't know until we get a majority" limbo into a normal sstable cell
        Show
        Jonathan Ellis added a comment - Sergio, the reason to go beyond pure Paxos by turning Learn into Commit is twofold: To allow a fresh start for a new round of CAS; otherwise, each new proposal is stuck re-issuing the first accepted value To move the data from the Paxos "maybe it's accepted but we don't know until we get a majority" limbo into a normal sstable cell
        Hide
        Sergio Bossa added a comment -

        Sergio, the reason to go beyond pure Paxos by turning Learn into Commit is twofold:
        To allow a fresh start for a new round of CAS; otherwise, each new proposal is stuck re-issuing the first accepted value
        To move the data from the Paxos "maybe it's accepted but we don't know until we get a majority" limbo into a normal sstable cell

        Yes, I get why it's there, I'm just not sure it's actually needed (which may well be due to my limited knowledge of hardcore C* internals, in particular around HH, RR and AE).

        By the way, I don't have enough time to do any more meaningful contributions to this ticket, so I'm backing off: it is in good hands anyways

        Show
        Sergio Bossa added a comment - Sergio, the reason to go beyond pure Paxos by turning Learn into Commit is twofold: To allow a fresh start for a new round of CAS; otherwise, each new proposal is stuck re-issuing the first accepted value To move the data from the Paxos "maybe it's accepted but we don't know until we get a majority" limbo into a normal sstable cell Yes, I get why it's there, I'm just not sure it's actually needed (which may well be due to my limited knowledge of hardcore C* internals, in particular around HH, RR and AE). By the way, I don't have enough time to do any more meaningful contributions to this ticket, so I'm backing off: it is in good hands anyways
        Hide
        Cristian Opris added a comment -

        Jonathan, even if you could rely on monotonically increasing timestamps (which is a big assumption), I don't think this will work because it does not clearly demarcate between paxos rounds.

        So you could have a scenario where you end up with different values committed at each replica:

             R1        R2       R3 
        1.     C0        C0       C0  //initial state ts=0
        2.             P1 <     P1<   //R3 initiates proposal ts=1 
        3.             A1 <     A1<   //accept ts=1
        4.                        C1  //R2 has majority, commits ts=1
        5.  >P2       >P2             //R1 initiates proposal ts=2
        6.  >A2       >A2             //accept ts=2; note this breaks Paxos since R1 should have chosen A1
        7.     C2                     //R1 commits C2 
        

        After step 7, R1=C2, R2=C0, R3=C1

        If a read comes in at this point, what would it resolve to ? You could say "use the highest timestamp" but that would require a read ALL

        More importantly, if a CAS request comes in, the validation of that depends on which replica it executes (unless again we do a read ALL before)

        The reason I suggested version counters is because this allows a replica to detect
        it has missed paxos rounds and needs to sync up before proceeding.

        The example above modified:

             R1        R2       R3 
        1.     C0        C0       C0  //initial state v=0
        2.             P1 <     P1<   //R3 initiates proposal v=1 
        3.             A1 <     A1<   //accept ts=1
        4.                        C1  //R2 has majority, commits ts=1
        5a.  >P1'                     //R1 wants to initiate its own P1
        5b.  << nack P1'              //but rejected since already committed
        5c.  << read C1               //read and commit C1 (finish round 1)
        5d.  >P2                      //restarts proposal with v=2
        5e.            >P2 & C1       //R2 receives P2 and notices it's missing C1 which it needs to commit first
        6.   >A2        >A2           //accept v=2; this is ok for Paxos as it's truly a new round 
        7.     C2                     //R1 commits C2 
        

        After step 7
        R1=(C2,A2) R2=(C1,A2) R3=(C1,A1)

        The most ambiguous quorum is R2,R3. Let's even assume that R1 has failed.
        The ambiguity can still be solved by initiating a new paxos round at version v=2 which will necessarily accept and commit A2. (this follows from Paxos)

        So to have a consistent read, the read might perform a paxos round to commit A2.

        This is a sketch of a proof this is correct:

        • if no replica can participate in a paxos round for version V, as acceptor or proposer, until it learns and commits locally the previous version V-1
        • then for Paxos to achieve a quorum of accept at V, a quorum of replicas must have committed V-1
        • once a quorum has accepted the same value for V, all replicas can eventually learn and commit V by simply rerunning a paxos round at V with value Nil (this can be triggered by an attempt to write V+1, or a read as shown above)
        Show
        Cristian Opris added a comment - Jonathan, even if you could rely on monotonically increasing timestamps (which is a big assumption), I don't think this will work because it does not clearly demarcate between paxos rounds. So you could have a scenario where you end up with different values committed at each replica: R1 R2 R3 1. C0 C0 C0 //initial state ts=0 2. P1 < P1< //R3 initiates proposal ts=1 3. A1 < A1< //accept ts=1 4. C1 //R2 has majority, commits ts=1 5. >P2 >P2 //R1 initiates proposal ts=2 6. >A2 >A2 //accept ts=2; note this breaks Paxos since R1 should have chosen A1 7. C2 //R1 commits C2 After step 7, R1=C2, R2=C0, R3=C1 If a read comes in at this point, what would it resolve to ? You could say "use the highest timestamp" but that would require a read ALL More importantly, if a CAS request comes in, the validation of that depends on which replica it executes (unless again we do a read ALL before) The reason I suggested version counters is because this allows a replica to detect it has missed paxos rounds and needs to sync up before proceeding. The example above modified: R1 R2 R3 1. C0 C0 C0 //initial state v=0 2. P1 < P1< //R3 initiates proposal v=1 3. A1 < A1< //accept ts=1 4. C1 //R2 has majority, commits ts=1 5a. >P1' //R1 wants to initiate its own P1 5b. << nack P1' //but rejected since already committed 5c. << read C1 //read and commit C1 (finish round 1) 5d. >P2 //restarts proposal with v=2 5e. >P2 & C1 //R2 receives P2 and notices it's missing C1 which it needs to commit first 6. >A2 >A2 //accept v=2; this is ok for Paxos as it's truly a new round 7. C2 //R1 commits C2 After step 7 R1=(C2,A2) R2=(C1,A2) R3=(C1,A1) The most ambiguous quorum is R2,R3. Let's even assume that R1 has failed. The ambiguity can still be solved by initiating a new paxos round at version v=2 which will necessarily accept and commit A2. (this follows from Paxos) So to have a consistent read, the read might perform a paxos round to commit A2. This is a sketch of a proof this is correct: if no replica can participate in a paxos round for version V, as acceptor or proposer, until it learns and commits locally the previous version V-1 then for Paxos to achieve a quorum of accept at V, a quorum of replicas must have committed V-1 once a quorum has accepted the same value for V, all replicas can eventually learn and commit V by simply rerunning a paxos round at V with value Nil (this can be triggered by an attempt to write V+1, or a read as shown above)
        Hide
        Jonathan Ellis added a comment -

        I think you missed this part:

        Instead of replicas checking raw timeuuid order for propose/promise, they will require that a new ballot be larger in the time component than an accepted ballot or mostRecentCommitted. (We do still want to use a timeuuid value though instead of a raw timestamp to guarantee uniqueness across proposals.)

        and this:

        Rejecting "newer" proposals w/ equal time components slows us down (we return false and client has to try again w/ a newer ballot) but does not compromise correctness.

        (So we do not "rely" on monotonically increasing timestamps; we reject coordinators until they propose an acceptable one.)

        Show
        Jonathan Ellis added a comment - I think you missed this part: Instead of replicas checking raw timeuuid order for propose/promise, they will require that a new ballot be larger in the time component than an accepted ballot or mostRecentCommitted. (We do still want to use a timeuuid value though instead of a raw timestamp to guarantee uniqueness across proposals.) and this: Rejecting "newer" proposals w/ equal time components slows us down (we return false and client has to try again w/ a newer ballot) but does not compromise correctness. (So we do not "rely" on monotonically increasing timestamps; we reject coordinators until they propose an acceptable one.)
        Hide
        Cristian Opris added a comment -

        Ok, but I think my point was that even if you can assume that (monotonic time) than it's still not
        correct, because when a proposal with a new value and higher timestamp than last committed comes in,
        accepting it over a previously accepted value would violate paxos. That is step 6 in my first example there. This at least breaks cas and cannot give consistent read

        However I confess I don't fully understand your solution, could you summarize or formalize a bit ?

        Show
        Cristian Opris added a comment - Ok, but I think my point was that even if you can assume that (monotonic time) than it's still not correct, because when a proposal with a new value and higher timestamp than last committed comes in, accepting it over a previously accepted value would violate paxos. That is step 6 in my first example there. This at least breaks cas and cannot give consistent read However I confess I don't fully understand your solution, could you summarize or formalize a bit ?
        Hide
        Cristian Opris added a comment -

        One more example of how mostRecentCommit is ambiguous:

        R1           R2      R3
          Ct0          Ct0     Ct0        //initial state at t0
                     Atn<     Atn <       //accept at Tn > t0
                              Atn -> Ctn  //R3 commits Ctn, mostRecentCommit = tn, Accept is cleared !
        >Atn+m                >Atn+m      //R3 accepts new value at tn+m > tn, this is valid since accept has been cleared
        Atn+m -> Ctn+m                    //ambiguous state with R1=Ctn+m, R2=Ct0, R3=Ctn, needs read ALL to resolve
        
        Show
        Cristian Opris added a comment - One more example of how mostRecentCommit is ambiguous: R1 R2 R3 Ct0 Ct0 Ct0 //initial state at t0 Atn< Atn < //accept at Tn > t0 Atn -> Ctn //R3 commits Ctn, mostRecentCommit = tn, Accept is cleared ! >Atn+m >Atn+m //R3 accepts new value at tn+m > tn, this is valid since accept has been cleared Atn+m -> Ctn+m //ambiguous state with R1=Ctn+m, R2=Ct0, R3=Ctn, needs read ALL to resolve
        Hide
        Cristian Opris added a comment - - edited

        OK, I believe what you're proposing is very close to what I am thinking.

        Essentially you're using mostRecentCommit timestamp (mrc) to track the paxos instance, while I am proposing to use a sequence value that is incremented on local commit.

        I expect that in your case as well this epoch number let's call it is different from proposal
        number, which can indeed be a timestamp (timeuuid)

        It seems this epoch doesn't have to be sequential so timestamp could work. (I would still go with
        a sequence just not to depend on the clock at all, but it's not necessary)

        I reworked the example above with more detail, and seems correct:

        
        R1           R2        R3
          Ct0          Ct0       Ct0       //initial state at t0
                    Ptn(mrc=t0) <-         //R3 makes a proposal numbered tn with most recent commited t0
                     --   ok   -->         //R2 promises 
                     Atn<     Atn <        //accept at Tn > t0
                              Atn -> Ctn   //R3 commits Ctn, mrc=tn, accept is cleared
            ---> Ptn+m(mrc=t0) >           //R1 makes a proposal tn+m with mRC=t0, last it knows of
            <--- nack (Ctn)                //R3 rejects since stale mRC; send Ctn directly for R1 to learn
         Ctn
            ---> Ptn+m(mrc=tn)             //propose again at mrc=tn                       
            <- ok ----------------         //R3 promises since mrc up to date
        >Atn+m                >Atn+m       //R3 accepts new value at tn+m > tn
        >Ctn+m                    
        

        State:
        R1=(Ctn+m), R2=(Ct0,Atn), R3=(Ctn,Atn+m)

        Now I think this is pretty much like the variant with version counter above.

        To do a consistent read, the read may have to perform the completion of the paxos round for Atn+m
        but it's guaranteed to resolve to Ctn+m whatever quorum it reads.

        Show
        Cristian Opris added a comment - - edited OK, I believe what you're proposing is very close to what I am thinking. Essentially you're using mostRecentCommit timestamp (mrc) to track the paxos instance, while I am proposing to use a sequence value that is incremented on local commit. I expect that in your case as well this epoch number let's call it is different from proposal number, which can indeed be a timestamp (timeuuid) It seems this epoch doesn't have to be sequential so timestamp could work. (I would still go with a sequence just not to depend on the clock at all, but it's not necessary) I reworked the example above with more detail, and seems correct: R1 R2 R3 Ct0 Ct0 Ct0 //initial state at t0 Ptn(mrc=t0) <- //R3 makes a proposal numbered tn with most recent commited t0 -- ok --> //R2 promises Atn< Atn < //accept at Tn > t0 Atn -> Ctn //R3 commits Ctn, mrc=tn, accept is cleared ---> Ptn+m(mrc=t0) > //R1 makes a proposal tn+m with mRC=t0, last it knows of <--- nack (Ctn) //R3 rejects since stale mRC; send Ctn directly for R1 to learn Ctn ---> Ptn+m(mrc=tn) //propose again at mrc=tn <- ok ---------------- //R3 promises since mrc up to date >Atn+m >Atn+m //R3 accepts new value at tn+m > tn >Ctn+m State: R1=(Ctn+m), R2=(Ct0,Atn), R3=(Ctn,Atn+m) Now I think this is pretty much like the variant with version counter above. To do a consistent read, the read may have to perform the completion of the paxos round for Atn+m but it's guaranteed to resolve to Ctn+m whatever quorum it reads.
        Hide
        Sylvain Lebresne added a comment -

        Jonathan Ellis Still a bit confused by your diagram really. Mainly, it seems from what's above that your mostRecentCommitted is a Paxos proposal number, but then you seem to use it to decide whether to move from one round to the other. So does that mean that what you call round in those diagram is just a proposal number (in which case, I'm still confused on how you actually use Paxos to do CAS) or does it means a full instance of the Paxos algorithm (in which case, I haven't fully understood how you decide it's ok to start a new round without breaking correctness)?

        All that to say, if you have some pseud-code of the whole things, I'd be interested .

        In the meantime, I've pushed my own reflection a bit too, fixing the impracticability I mention earlier (of an ever growing log basically). I wrote the thing down to convince myself this was working and get rid of foggy part. This is a largely a brain dump, though it does contain a fairly precise algorithm that as far as I can tell, work (but I could definitively have missed something). I put that brain at http://goo.gl/pnq4Z, in case that's of interest (it's not short, but I think it's fairly precise). That proposal definitively share a number of idea/similarity with Jonathan's one, but it's definitively not similar since it doesn't provide the same tradeoff (it doesn't allow mixing CAS and non-CAS operation for one, and is not "rate limited by clock resolution and clock skew", though I'm not sure what that means tbh). It's fairly simple to implement too. Anyway, just wanted to share my own reflexion.

        Show
        Sylvain Lebresne added a comment - Jonathan Ellis Still a bit confused by your diagram really. Mainly, it seems from what's above that your mostRecentCommitted is a Paxos proposal number, but then you seem to use it to decide whether to move from one round to the other. So does that mean that what you call round in those diagram is just a proposal number (in which case, I'm still confused on how you actually use Paxos to do CAS) or does it means a full instance of the Paxos algorithm (in which case, I haven't fully understood how you decide it's ok to start a new round without breaking correctness)? All that to say, if you have some pseud-code of the whole things, I'd be interested . In the meantime, I've pushed my own reflection a bit too, fixing the impracticability I mention earlier (of an ever growing log basically). I wrote the thing down to convince myself this was working and get rid of foggy part. This is a largely a brain dump, though it does contain a fairly precise algorithm that as far as I can tell, work (but I could definitively have missed something). I put that brain at http://goo.gl/pnq4Z , in case that's of interest (it's not short, but I think it's fairly precise). That proposal definitively share a number of idea/similarity with Jonathan's one, but it's definitively not similar since it doesn't provide the same tradeoff (it doesn't allow mixing CAS and non-CAS operation for one, and is not "rate limited by clock resolution and clock skew", though I'm not sure what that means tbh). It's fairly simple to implement too. Anyway, just wanted to share my own reflexion.
        Hide
        Cristian Opris added a comment -

        FWIW, just to clarify my own examples, I can't speak for Jonathan: version counter or most recent commit is NOT the paxos proposal number. The Paxos proposal number I've ommitted in most of my examples except for the last more detailed one. Timeuuid is fine for proposal number.

        Also with regard to logging/no logging. I believe you only need to keep a log if you plan to replicate operations rather than state.
        Transfering state (as we discussed so far) does not require a log but makes it impractical to replicate large values, so this is the main trade off, I don't believe it's got anything to do with paxos.

        Show
        Cristian Opris added a comment - FWIW, just to clarify my own examples, I can't speak for Jonathan: version counter or most recent commit is NOT the paxos proposal number . The Paxos proposal number I've ommitted in most of my examples except for the last more detailed one. Timeuuid is fine for proposal number. Also with regard to logging/no logging. I believe you only need to keep a log if you plan to replicate operations rather than state. Transfering state (as we discussed so far) does not require a log but makes it impractical to replicate large values, so this is the main trade off, I don't believe it's got anything to do with paxos.
        Hide
        Cristian Opris added a comment - - edited

        Sylvain Lebresne I have read your pseudo-code, seems pretty much what I was trying to describe with the version counter that counts paxos rounds (except I was thinking at row level rather than column level)

        I noticed however that while the leader's proposal is aborted if it has a stale round, the acceptor algorithm does not handle the case when the
        acceptor replica is behind.

        Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() < R-1

        Edit: C_current.timestamp needs to be exactly R-1 if you increment the counter on sending the proposal.

        One way to do that is to nack the proposal indicating it needs to catch up and either expect to receive a "snapshot" from the leader or do a read.

        Also note you don't need to send the column values with the proposal. If you get quorum for the proposal you can perform the CAS locally and just
        send the new column value with the accept

        Essentially consensus is on the next column value to write, not the CAS. Since proposer is guaranteed to be up to date before sending accept,
        it can do the CAS locally.

        Show
        Cristian Opris added a comment - - edited Sylvain Lebresne I have read your pseudo-code, seems pretty much what I was trying to describe with the version counter that counts paxos rounds (except I was thinking at row level rather than column level) I noticed however that while the leader's proposal is aborted if it has a stale round, the acceptor algorithm does not handle the case when the acceptor replica is behind. Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() < R-1 Edit: C_current.timestamp needs to be exactly R-1 if you increment the counter on sending the proposal. One way to do that is to nack the proposal indicating it needs to catch up and either expect to receive a "snapshot" from the leader or do a read. Also note you don't need to send the column values with the proposal. If you get quorum for the proposal you can perform the CAS locally and just send the new column value with the accept Essentially consensus is on the next column value to write, not the CAS. Since proposer is guaranteed to be up to date before sending accept, it can do the CAS locally.
        Hide
        Sylvain Lebresne added a comment -

        Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() < R

        Note sure I follow. On the acceptor, if C_current.timestamp() < R, then it means the message if for a round that hasn't been decided yet (more precisely, that we haven't learned out). That's the case where we actually do some work (as the pseudo-code does really).

        Also note you don't need to send the column values with proposal

        Not true. Even if a proposer has his value "accepted", there is not guarantee that it will be the one learning it, or even the only one learning it for that matter. For a given round, the value is decided as soon as a quorum of acceptor accepts it basically, irrelevant of whether the original proposer of that value gets the commit acks or not.

        Essentially consensus is on the next column value to write

        Yes. As as discuss later in my brain dump, we'd likely want to support normal inserts at least. The actual operation is largely of irrelevant (that's why it's a blob in the paxos_state table ).

        Show
        Sylvain Lebresne added a comment - Basically in the acceptor algorithm you don't seem to handle the case where C_current.timestamp() < R Note sure I follow. On the acceptor, if C_current.timestamp() < R, then it means the message if for a round that hasn't been decided yet (more precisely, that we haven't learned out). That's the case where we actually do some work (as the pseudo-code does really). Also note you don't need to send the column values with proposal Not true. Even if a proposer has his value "accepted", there is not guarantee that it will be the one learning it, or even the only one learning it for that matter. For a given round, the value is decided as soon as a quorum of acceptor accepts it basically, irrelevant of whether the original proposer of that value gets the commit acks or not. Essentially consensus is on the next column value to write Yes. As as discuss later in my brain dump, we'd likely want to support normal inserts at least. The actual operation is largely of irrelevant (that's why it's a blob in the paxos_state table ).
        Hide
        Cristian Opris added a comment - - edited

        Sorry, I've probably edited my comment after your reply.

        C_current.timestamp needs to be exactly R-1 if you increment the counter on sending the proposal.

        If it's less, then the acceptor hasn't learned the previously committed value (R-1) so can't participate in round R, otherwise we're mixing up rounds.

        If it's more, then the proposer is behind so you already handle that.

        Regarding " If you get quorum for the proposal you can perform the CAS locally and just
        send the new column value with the accept"

        By that I meant you can do the validate part of the CAS locally, not actually write the CAS.

        Basically any operation (not just CAS) can be evaluated (in memory) by the proposal after it gets quorum for the proposal (which guarantees it has the latest committed value) so it obtains the value to send for acceptance. This is more of an optimization where you exchange and agree on values rather than operations (state transfer replication). Also solves the problem of where to validate the CAS.

        Show
        Cristian Opris added a comment - - edited Sorry, I've probably edited my comment after your reply. C_current.timestamp needs to be exactly R-1 if you increment the counter on sending the proposal. If it's less, then the acceptor hasn't learned the previously committed value (R-1) so can't participate in round R, otherwise we're mixing up rounds. If it's more, then the proposer is behind so you already handle that. Regarding " If you get quorum for the proposal you can perform the CAS locally and just send the new column value with the accept" By that I meant you can do the validate part of the CAS locally, not actually write the CAS. Basically any operation (not just CAS) can be evaluated (in memory) by the proposal after it gets quorum for the proposal (which guarantees it has the latest committed value) so it obtains the value to send for acceptance. This is more of an optimization where you exchange and agree on values rather than operations (state transfer replication). Also solves the problem of where to validate the CAS.
        Hide
        Sylvain Lebresne added a comment -

        C_current.timestamp needs to be exactly R-1

        I don't think that's necessary. We never mix rounds, because all operation and paxos state include the round number. If you're an acceptor that is behind and some proposer that is also behind propose some value, then so be it (we're still guarantee the proposer will learn the value that was decided). Paxos is fine with a acceptor failing and recovering at any time, which is exactly what this is about.

        Show
        Sylvain Lebresne added a comment - C_current.timestamp needs to be exactly R-1 I don't think that's necessary. We never mix rounds, because all operation and paxos state include the round number. If you're an acceptor that is behind and some proposer that is also behind propose some value, then so be it (we're still guarantee the proposer will learn the value that was decided). Paxos is fine with a acceptor failing and recovering at any time, which is exactly what this is about.
        Hide
        Cristian Opris added a comment -

        Say you have this:

        Proposer has committed R-1, starts round R, proposal timestamp Tn

        Acceptor recovers with committed R-n < R-1, and has accepted value A at R-n+1 < R-1 at Tm in the paxos state log.

        When Acceptor receives proposal, if it doesn't check R, if Tm > Tn (clock mismatch) according to paxos it needs to send it's old accepted value and the proposer will have to use it to commit. It will end up committing an old value.

        It's an edge case but not impossible. Paxos holds within the same round, but not across rounds.

        This makes sense because a Paxos round just means agree on a value which once accepted by a quorum
        can never change.

        Which is why you can't have an out of date replica participate in a round.

        The idea is to move from quorum that committed (learned) R to quorum that accepts R+1 to quorum that commits R+1 and so on. Note the quorums don't need to be made of same components.

        To ensure this you maintain the invariant that you can't propose or accept R+1 locally if you haven't committed R

        So a replica can die and recover, but to recover and participate in paxos needs to learn the latest value.

        This also gives you consistent read (at the possible cost of an extra read paxos proposal to ensure that the last paxos round is committed if left ambiguous)

        Show
        Cristian Opris added a comment - Say you have this: Proposer has committed R-1, starts round R, proposal timestamp Tn Acceptor recovers with committed R-n < R-1, and has accepted value A at R-n+1 < R-1 at Tm in the paxos state log. When Acceptor receives proposal, if it doesn't check R, if Tm > Tn (clock mismatch) according to paxos it needs to send it's old accepted value and the proposer will have to use it to commit. It will end up committing an old value. It's an edge case but not impossible. Paxos holds within the same round, but not across rounds. This makes sense because a Paxos round just means agree on a value which once accepted by a quorum can never change. Which is why you can't have an out of date replica participate in a round. The idea is to move from quorum that committed (learned) R to quorum that accepts R+1 to quorum that commits R+1 and so on. Note the quorums don't need to be made of same components. To ensure this you maintain the invariant that you can't propose or accept R+1 locally if you haven't committed R So a replica can die and recover, but to recover and participate in paxos needs to learn the latest value. This also gives you consistent read (at the possible cost of an extra read paxos proposal to ensure that the last paxos round is committed if left ambiguous)
        Hide
        Cristian Opris added a comment -

        So I think what you're doing at the moment is effectively using the (R,P) tuple as the
        proposal number within a single continuous Paxos instance, that sometimes may agree on things
        and sometimes replicas learn the agreed value.

        Show
        Cristian Opris added a comment - So I think what you're doing at the moment is effectively using the (R,P) tuple as the proposal number within a single continuous Paxos instance, that sometimes may agree on things and sometimes replicas learn the agreed value.
        Hide
        Sylvain Lebresne added a comment -

        When Acceptor receives proposal, if it doesn't check R, if Tm > Tn (clock mismatch) according to paxos it needs to send it's old accepted value

        In my solution/proposal, a round is a fully separated Paxos instance (it's not just a proposal round for instance). Meaning that comparing a proposal timestamp Tn for Round R with a proposal timestamp Tm for another round R-1 (which is the case you are talking about) makes no sense whatsoever and is thus never done. And it doesn't break Paxos exactly because we don't mix things from 2 completely separated instance of the raw Paxos algorithm.

        Show
        Sylvain Lebresne added a comment - When Acceptor receives proposal, if it doesn't check R, if Tm > Tn (clock mismatch) according to paxos it needs to send it's old accepted value In my solution/proposal, a round is a fully separated Paxos instance (it's not just a proposal round for instance). Meaning that comparing a proposal timestamp Tn for Round R with a proposal timestamp Tm for another round R-1 (which is the case you are talking about) makes no sense whatsoever and is thus never done. And it doesn't break Paxos exactly because we don't mix things from 2 completely separated instance of the raw Paxos algorithm.
        Hide
        Cristian Opris added a comment -

        Sylvain Lebresne I see you intend to separate it but I'm not sure the separation is correct.

        In receive(PROPOSE) you have:

        if (P_max != null && P <= P_max)
                {
                    // We've already seen a more recent proposal
                    send(REJECT);
                }
        

        If this doesn't hold you have the tm > tn thing I was talking about and then:

        // Read the most recent *accepted* proposal we have for round R (simple slice query).
                P', CAS(u', v') = read_max(paxos_state, C, R, 'accepted')
        

        If I read that correctly as read_max(R) this could be a stale accept from a previously unfinished round that the acceptor never committed.

        Show
        Cristian Opris added a comment - Sylvain Lebresne I see you intend to separate it but I'm not sure the separation is correct. In receive(PROPOSE) you have: if (P_max != null && P <= P_max) { // We've already seen a more recent proposal send(REJECT); } If this doesn't hold you have the tm > tn thing I was talking about and then: // Read the most recent *accepted* proposal we have for round R (simple slice query). P', CAS(u', v') = read_max(paxos_state, C, R, 'accepted') If I read that correctly as read_max(R) this could be a stale accept from a previously unfinished round that the acceptor never committed.
        Hide
        Cristian Opris added a comment -

        Or maybe I misread that. Does it mean slice on R or slice on the latest proposal for R ?

        Show
        Cristian Opris added a comment - Or maybe I misread that. Does it mean slice on R or slice on the latest proposal for R ?
        Hide
        Sylvain Lebresne added a comment -

        If a node receive a PROPOSE for round R there is 2 possible cases:

        • either R is a round for which the node has already "learn" the value. In which case, we will take the "if (C_current.timestamp() >= R)" branch, i.e. the "repair the proposer that we know is behind"
        • or R is a round for which the node hasn't already learn the value. In that case, we read whatever paxos state for round R. Whether round R has been "learned" by other replica doesn't matter at all. That is, say the last round the node has "learn" about is R_last, then it's absolutely possible for the R received in the propose to be R_last+1 or R_last+2 or R_last+whatever. The point being, every read of the paxos state is parametrized by the round itself. So a propose for round R_last+2 will look at what the replica has (or doesn't have) for round R_last+2 only (the state for round R_last+1 is not queried). Note that node can absolutely have state for more than one round at the same time, but it will never mix those by construction.
        Show
        Sylvain Lebresne added a comment - If a node receive a PROPOSE for round R there is 2 possible cases: either R is a round for which the node has already "learn" the value. In which case, we will take the "if (C_current.timestamp() >= R)" branch, i.e. the "repair the proposer that we know is behind" or R is a round for which the node hasn't already learn the value. In that case, we read whatever paxos state for round R. Whether round R has been "learned" by other replica doesn't matter at all. That is, say the last round the node has "learn" about is R_last, then it's absolutely possible for the R received in the propose to be R_last+1 or R_last+2 or R_last+whatever. The point being, every read of the paxos state is parametrized by the round itself. So a propose for round R_last+2 will look at what the replica has (or doesn't have) for round R_last+2 only (the state for round R_last+1 is not queried). Note that node can absolutely have state for more than one round at the same time, but it will never mix those by construction.
        Hide
        Cristian Opris added a comment -

        I see, it makes sense now. I thought read_max reads the last accepted round instead of the last proposal for the current round.

        That would be correct strictly from a paxos point of view but it still would not give us the consistency in terms
        of qourum commit that we desire

        Consider the following case (note I'm omitting proposals and proposal numbers as it's not relevant)

             X        Y         Z 
        1.     C0        C0       C0       //initial state committed R=0
        2.             A(R=1)<  A(R=1)<    //Z leads R=1
        3.                       C1        //Z has majority with Y, commits locally but fails to commit on Y and X
        4.  A(R=2)<             A(R=2)<    //Z leads R=2
        5.     C2                C2        //Z has majority with X, commits locally and on X, but not on Y
        6.            A(R=3)<   A(R=3)<    //Z leads R=3
        5.                       C3        //Z has majority with X, commits locally but fails to commit on X and Y
        
        

        Now state is X=C2, Y=C0, Z=C3

        There's no commit quorum so to do a consistent read you'd probably have to resolve all unfinished rounds.

        To prevent that I think we should ensure a quorum is committed before we proceed with the next round. Having the acceptor check it has the latest
        round before accepting would achieve that.

        You might say "do not commit locally until we commit to a quorum" but I fear this can still lead to a similar situation

        Show
        Cristian Opris added a comment - I see, it makes sense now. I thought read_max reads the last accepted round instead of the last proposal for the current round. That would be correct strictly from a paxos point of view but it still would not give us the consistency in terms of qourum commit that we desire Consider the following case (note I'm omitting proposals and proposal numbers as it's not relevant) X Y Z 1. C0 C0 C0 //initial state committed R=0 2. A(R=1)< A(R=1)< //Z leads R=1 3. C1 //Z has majority with Y, commits locally but fails to commit on Y and X 4. A(R=2)< A(R=2)< //Z leads R=2 5. C2 C2 //Z has majority with X, commits locally and on X, but not on Y 6. A(R=3)< A(R=3)< //Z leads R=3 5. C3 //Z has majority with X, commits locally but fails to commit on X and Y Now state is X=C2, Y=C0, Z=C3 There's no commit quorum so to do a consistent read you'd probably have to resolve all unfinished rounds. To prevent that I think we should ensure a quorum is committed before we proceed with the next round. Having the acceptor check it has the latest round before accepting would achieve that. You might say "do not commit locally until we commit to a quorum" but I fear this can still lead to a similar situation
        Hide
        Sylvain Lebresne added a comment -

        I'm sorry but you are not reading what I have written. You're example is not a possible execution of the algorithm in my document. Namely, no proposer will start proposing on round 2 until it has "learned" the value for round 1. And value are learn only once the full paxos algorithm has unfolded. This is not at all what your example, you're incrementing the round as soon as the agree phase is done basically, which is not what I'm suggesting. Maybe what confuse you is that what I call "commit" is the act of a proposer asking to acceptors to accept a given proposal. This is different from what Jonathan calls commit, which I call "learn". I'm sorry if that's confusing, but at the same time, I think my naming makes sense in my proposal for a number of reasons (but I'm not saying Jonathan is wrong in using his own terminology because 1) I don't understand Jonathan's proposal at this point, so I can't judge if the terminology is good or not and 2) this is only terminology, it shouldn't care too much as long as the context makes it clear what is what (and I've written 500 lines of context!)).

        Show
        Sylvain Lebresne added a comment - I'm sorry but you are not reading what I have written. You're example is not a possible execution of the algorithm in my document. Namely, no proposer will start proposing on round 2 until it has "learned" the value for round 1. And value are learn only once the full paxos algorithm has unfolded. This is not at all what your example, you're incrementing the round as soon as the agree phase is done basically, which is not what I'm suggesting. Maybe what confuse you is that what I call "commit" is the act of a proposer asking to acceptors to accept a given proposal. This is different from what Jonathan calls commit, which I call "learn". I'm sorry if that's confusing, but at the same time, I think my naming makes sense in my proposal for a number of reasons (but I'm not saying Jonathan is wrong in using his own terminology because 1) I don't understand Jonathan's proposal at this point, so I can't judge if the terminology is good or not and 2) this is only terminology, it shouldn't care too much as long as the context makes it clear what is what (and I've written 500 lines of context!)).
        Hide
        Cristian Opris added a comment -

        I understand what you mean by terminology but this is not where the confusion is coming from.

        My commit C1,C2 etc is your learn, agreed. My accept is your commit.

        It may be a bit confusing because I'm not detailing everything in the diagram

        So when Z goes into C1, that implies: it receives accept from Y, it commits (i.e. writes) the value locally
        and then it sends learn message to X and Y, which might fail without Z having any record of that.

        I know this is not the exact behaviour in your algoritm. I'm not sure how the leader commits (learns) the value locally, is it because it ends
        up calling receive(LEARN) locally (i.e. acting as acceptor as well) ?

        But this doesn't change my point.

        *My point is the learn can fail without the leader being aware, which leads to a state where each replica is at a different
        stage of learning. Even if the paxos round states are correct in terms of accepted values (what you call commit), they are not finished in
        terms of learning*

        Show
        Cristian Opris added a comment - I understand what you mean by terminology but this is not where the confusion is coming from. My commit C1,C2 etc is your learn, agreed. My accept is your commit. It may be a bit confusing because I'm not detailing everything in the diagram So when Z goes into C1, that implies: it receives accept from Y, it commits (i.e. writes) the value locally and then it sends learn message to X and Y, which might fail without Z having any record of that. I know this is not the exact behaviour in your algoritm. I'm not sure how the leader commits (learns) the value locally, is it because it ends up calling receive(LEARN) locally (i.e. acting as acceptor as well) ? But this doesn't change my point. *My point is the learn can fail without the leader being aware, which leads to a state where each replica is at a different stage of learning. Even if the paxos round states are correct in terms of accepted values (what you call commit), they are not finished in terms of learning*
        Hide
        Cristian Opris added a comment -

        Let me rewrite the example with your terminology:

             X        Y         Z 
        1.     L0        L0       L0           //initial state committed R=0
        2.             P/C(R=1)<  P/C(R=1)<    //Z sends propose/commit (i.e. paxos round) R=1
        3.                       L1            //Z has majority with Y, learns locally L1 but X and Y fail to learn
        4.   P/C(R=2)<             P/C(R=2)<   //Z leads R=2
        5.     L2                L2            //Z has majority with X, both learn L2 but not Y
        6.           P/C(R=3)<    P/C(R=3)<  //Z leads R=3
        5.                       L3            //Z has majority with Y, learsn locally L3 but X and Y fail to learn
        
        Show
        Cristian Opris added a comment - Let me rewrite the example with your terminology: X Y Z 1. L0 L0 L0 //initial state committed R=0 2. P/C(R=1)< P/C(R=1)< //Z sends propose/commit (i.e. paxos round) R=1 3. L1 //Z has majority with Y, learns locally L1 but X and Y fail to learn 4. P/C(R=2)< P/C(R=2)< //Z leads R=2 5. L2 L2 //Z has majority with X, both learn L2 but not Y 6. P/C(R=3)< P/C(R=3)< //Z leads R=3 5. L3 //Z has majority with Y, learsn locally L3 but X and Y fail to learn
        Hide
        Cristian Opris added a comment -

        So looking at your proposer algorithm where you do:

        send(LEARN(C_new));

        This can succeed on the lead replica (locally assuming you're using one of the replicas as lead) but fail on all or some of the others.

        The replica can happily continue with the next round even if a quorum has not yet learned the previous round, leading to the situation
        I'm describing

        Show
        Cristian Opris added a comment - So looking at your proposer algorithm where you do: send(LEARN(C_new)); This can succeed on the lead replica (locally assuming you're using one of the replicas as lead) but fail on all or some of the others. The replica can happily continue with the next round even if a quorum has not yet learned the previous round, leading to the situation I'm describing
        Hide
        Cristian Opris added a comment -

        BTW, last time I looked into it, Cassandra still didn't guarantee monotonic read or read after write even with qourum read/write consistency level.

        There is still a window (even if small) where a qourum read catches the replicas in a partial write state (A,A,B) with B timestamp > A timestamp
        and the result depends on which majority I read, correct ? So we shouldn't rely on that for paxos, considering the paxos is meant to give true
        consistency read (hopefully)

        Show
        Cristian Opris added a comment - BTW, last time I looked into it, Cassandra still didn't guarantee monotonic read or read after write even with qourum read/write consistency level. There is still a window (even if small) where a qourum read catches the replicas in a partial write state (A,A,B) with B timestamp > A timestamp and the result depends on which majority I read, correct ? So we shouldn't rely on that for paxos, considering the paxos is meant to give true consistency read (hopefully)
        Hide
        Sylvain Lebresne added a comment -

        So looking at your proposer algorithm where you do:
        send(LEARN(C_new));
        This can succeed on the lead replica (locally assuming you're using one of the replicas as lead) but fail on all or some of the others.

        Sure it can. But it doesn't matter. It's indeed possible for one proposer to accept/commit/learn multiple rounds while being the only one actually getting the learn. That's not a problem in itself. The point being, if another replica (say X at the end of your example) start proposing, it will not do in in round 3, because it has not learn about that round yet. It will try round 1. At which point, either Z is part of the quorum that agree on his message, and X will get an abort with the round 3 value, that it will apply and then propose on round 4; or Z is not part of said quorum, but then paxos guarantee that X will end up learning the exact same value Z learned for round 1 (because that quorum of replica hasn't erased his paxos state for round 1 since erasure is done on reception of learn, but they haven't learned (if they do have learned, we're in the Z case where ABORT is sent)).

        Show
        Sylvain Lebresne added a comment - So looking at your proposer algorithm where you do: send(LEARN(C_new)); This can succeed on the lead replica (locally assuming you're using one of the replicas as lead) but fail on all or some of the others. Sure it can. But it doesn't matter. It's indeed possible for one proposer to accept/commit/learn multiple rounds while being the only one actually getting the learn. That's not a problem in itself. The point being, if another replica (say X at the end of your example) start proposing, it will not do in in round 3, because it has not learn about that round yet. It will try round 1. At which point, either Z is part of the quorum that agree on his message, and X will get an abort with the round 3 value, that it will apply and then propose on round 4; or Z is not part of said quorum, but then paxos guarantee that X will end up learning the exact same value Z learned for round 1 (because that quorum of replica hasn't erased his paxos state for round 1 since erasure is done on reception of learn, but they haven't learned (if they do have learned, we're in the Z case where ABORT is sent)).
        Hide
        Cristian Opris added a comment -

        Agreed, as I said the paxos itself is correct, and all replicas will eventually learn the outcome.

        However, in the example above, which you agree can happen, how do you do a consistent read when each replica is on it's learned value ?

        And by consistent read I mean one that is truly monotonic and truly read after write so that CAS doesn't fail spuriously.

        If state is X=L2, Y=L0, Z=L3, say no concurrency and a single client, it needs to be able to read a value, and then send back a CAS
        that replaces that value (single client remember) and expect that CAS to succeed.

        With a proper CAS clients can implement more interesting distributed concurrency control as with ZK so it's worth providing true serial consistency
        at least at column level.

        Show
        Cristian Opris added a comment - Agreed, as I said the paxos itself is correct, and all replicas will eventually learn the outcome. However, in the example above, which you agree can happen, how do you do a consistent read when each replica is on it's learned value ? And by consistent read I mean one that is truly monotonic and truly read after write so that CAS doesn't fail spuriously. If state is X=L2, Y=L0, Z=L3, say no concurrency and a single client, it needs to be able to read a value, and then send back a CAS that replaces that value (single client remember) and expect that CAS to succeed. With a proper CAS clients can implement more interesting distributed concurrency control as with ZK so it's worth providing true serial consistency at least at column level.
        Hide
        Sylvain Lebresne added a comment -

        how do you do a consistent read when each replica is on it's learned value ?

        The short answer is, you propose the read through Paxos, as a proposer can't get his value committed unless he his up to date.

        Show
        Sylvain Lebresne added a comment - how do you do a consistent read when each replica is on it's learned value ? The short answer is, you propose the read through Paxos, as a proposer can't get his value committed unless he his up to date.
        Hide
        Cristian Opris added a comment -

        Yeah but that doesn't work in this case.

        Suppose the read goes to X and X coordinates with Y being completely ignored, they will learn L2 together when they should learn L3

        Show
        Cristian Opris added a comment - Yeah but that doesn't work in this case. Suppose the read goes to X and X coordinates with Y being completely ignored, they will learn L2 together when they should learn L3
        Hide
        Cristian Opris added a comment -

        I mean X and Y coordinate, Z is ignored (or failed)

        Show
        Cristian Opris added a comment - I mean X and Y coordinate, Z is ignored (or failed)
        Hide
        Cristian Opris added a comment -

        And I meant the leader is Y

        Show
        Cristian Opris added a comment - And I meant the leader is Y
        Hide
        Cristian Opris added a comment -

        And I see how it could work recursively where Y learns L2 and retries the read but this leads to my previous comment that to read
        consistently you'd have to resolve all previous paxos rounds.

        Show
        Cristian Opris added a comment - And I see how it could work recursively where Y learns L2 and retries the read but this leads to my previous comment that to read consistently you'd have to resolve all previous paxos rounds.
        Hide
        Cristian Opris added a comment -

        More obvious example:

             X        Y         Z 
        1.     L0        L0       L0           
        2.             P/C(R=1)<  P/C(R=1)<    
        3.                       L1            
        4.             P/C(R=2)<  P/C(R=2)<   
        5.                       L2            
        6.             P/C(R=3)<  P/C(R=3)<  
        5.                       L3            
        

        Z dies, that's fine the values are not lost they survive in Y. However a read with just Z and Y would need to resolve the last 3 paxos rounds to get the correct value.

        That could work, but what I'm suggesting is to do that at accept (your commit) time ensuring you always have a quorum that has learned
        the latest value.

        Show
        Cristian Opris added a comment - More obvious example: X Y Z 1. L0 L0 L0 2. P/C(R=1)< P/C(R=1)< 3. L1 4. P/C(R=2)< P/C(R=2)< 5. L2 6. P/C(R=3)< P/C(R=3)< 5. L3 Z dies, that's fine the values are not lost they survive in Y. However a read with just Z and Y would need to resolve the last 3 paxos rounds to get the correct value. That could work, but what I'm suggesting is to do that at accept (your commit) time ensuring you always have a quorum that has learned the latest value.
        Hide
        Cristian Opris added a comment -

        Blimey I meant a read with X and Y obviously since Z dies...

        Show
        Cristian Opris added a comment - Blimey I meant a read with X and Y obviously since Z dies...
        Hide
        Sylvain Lebresne added a comment -

        That could work

        Not only it could work, but that's exactly what the pseudo-code in my document does (I'm sorry to be rude, but I took the pain of writing 500 lines of explanation/pseudo-code in a separate document to avoid "spamming" the JIRA with my whole train of though. It would have been nice of you to make sure you understand what's there before shooting a comment here every other minute, because this will make it very hard to follow for other people interested in the issue).

        And I don't pretend my proposal can't be optimized, only that it's correct. But truth being told, in practice, learn messages won't be lost randomly as in your example, so I'm not even sure it's worth optimizing in practice (or that it will be an optimization in the first place, "ensuring you always have a quorum that has learned" during accept won't have 0 cost).

        Show
        Sylvain Lebresne added a comment - That could work Not only it could work, but that's exactly what the pseudo-code in my document does (I'm sorry to be rude, but I took the pain of writing 500 lines of explanation/pseudo-code in a separate document to avoid "spamming" the JIRA with my whole train of though. It would have been nice of you to make sure you understand what's there before shooting a comment here every other minute, because this will make it very hard to follow for other people interested in the issue). And I don't pretend my proposal can't be optimized, only that it's correct. But truth being told, in practice, learn messages won't be lost randomly as in your example, so I'm not even sure it's worth optimizing in practice (or that it will be an optimization in the first place, "ensuring you always have a quorum that has learned" during accept won't have 0 cost).
        Hide
        Cristian Opris added a comment -

        The read path is not detailed in your document. Also even with 500 lines of code and explanations, there can be ambiguities.

        It's not obvious to me that you intended read to repair N rounds of paxos before responding. Yes, could work but I don't feel it's
        the best way to do it.

        Sorry if you feel that I'm spamming this Jira but:

        • It's the only forum I have available. I can't post to the mailing list for objective reasons.
        • you guys can be very selective in what comments you actually care to respond to. I've been going on about version counters and separating rounds for a long time here without getting any interest but then you come back and propose something very similar. Since you posted it I thought you're looking for some actual review and feedback.

        in practice, learn messages won't be lost randomly as in your example

        This is really not an assumption I would make. Why even bother with Paxos if nothing can go wrong anyway ?

        Show
        Cristian Opris added a comment - The read path is not detailed in your document. Also even with 500 lines of code and explanations, there can be ambiguities. It's not obvious to me that you intended read to repair N rounds of paxos before responding. Yes, could work but I don't feel it's the best way to do it. Sorry if you feel that I'm spamming this Jira but: It's the only forum I have available. I can't post to the mailing list for objective reasons. you guys can be very selective in what comments you actually care to respond to. I've been going on about version counters and separating rounds for a long time here without getting any interest but then you come back and propose something very similar. Since you posted it I thought you're looking for some actual review and feedback. in practice, learn messages won't be lost randomly as in your example This is really not an assumption I would make. Why even bother with Paxos if nothing can go wrong anyway ?
        Hide
        Jonathan Ellis added a comment -

        I've posted a draft of a CAS method to https://github.com/jbellis/cassandra/commits/5062 following the outline I posted (way, way) above.

        Stuff that is missing:

        • We probably want to support passing CL.SERIAL to reads, to allow asking "What happened to the CAS operation I just tried, but then the coordinator died before replying to me?" This would just do the prepare phase, and finish any in-progress updates, before performing a quorum read.
        • May want to allow passing a CL to CAS to specify how many replicas you want to ack the commit before returning.
        • Any form of CQL support
        • Any testing
        Show
        Jonathan Ellis added a comment - I've posted a draft of a CAS method to https://github.com/jbellis/cassandra/commits/5062 following the outline I posted (way, way) above. Stuff that is missing: We probably want to support passing CL.SERIAL to reads, to allow asking "What happened to the CAS operation I just tried, but then the coordinator died before replying to me?" This would just do the prepare phase, and finish any in-progress updates, before performing a quorum read. May want to allow passing a CL to CAS to specify how many replicas you want to ack the commit before returning. Any form of CQL support Any testing
        Hide
        Jonathan Ellis added a comment -

        Pushed a new version to https://github.com/jbellis/cassandra/commits/5062-2. Now passes a basic test.

        Show
        Jonathan Ellis added a comment - Pushed a new version to https://github.com/jbellis/cassandra/commits/5062-2 . Now passes a basic test.
        Hide
        Sylvain Lebresne added a comment -

        There's a unclear part on the algorithm itself for which I'm not sure what is the intent. During prepare, if a replica "promise", the ballot it sends back in the response is the one it just got (the one the proposer sent). So in the proposer (in SP.cas()), summary.inProgressBallot is necessary our own ballot (unless it's a reject, but then we don't care anymore). Meaning, that in SP.cas(), timeComparator.compare(summary.inProgressBallot, summary.mostRecentCommitted) >= 0 could be simplified in timeComparator.compare(ballot, summary.mostRecentCommitted) >= 0. But it also mean that in PrepareCallback, the inProgressUpdate we keep is pretty much chosen randomly (since all inProgressBallot will in fact be equal to ballot). Was that the intent? Or was the intent that during prepare, when the replica promise, it returns the previous inProgressBalot, i.e. the one before setting the new balot? (I think there might be a problem with both choice but before getting to that I want to make sure what is the initial intent).

        Some other remarks while I'm at it:

        • PaxosState.propose always return a true as first argument of PrepareResponse (it always "promised").
        • mostRecentCommitted doesn't seem to be ever set.
        • I don't think the commit business work. Commit segments can be deleted at any time due to flush, so I don't see how we can guarantee the persistency of the paxos state. Furthermore, when we replay the commit log paxos entry, we don't re-append them to the commit log, so if a node restart, play his log and shutdown right away, it'll lost his paxos state too. Why not just use a System table for the Paxos state? (I don't even think performance would be a big issue because we can do queries by names that are relatively cheap and besides most of the paxos state is deleted by commit, so the only part that will end up in sstables is the mostRecentCommitted, but that's small and very very cacheable).
        • I'm confused by FBUtilities.timeComparator. I'm not sure what is the intent of using/comparing the clockSequence first, but I'm pretty sure this is broken. Should that compare the timestamps of the UUIDs (The clock sequence is not the timestamp). Furthermore, wasn't the goal to have a comparator to have it only compare and timestamps (and thus not break tie on same timestamp)? Lastly, wasn't the goal to reuse the ballot timestamp as the timestamp of the columns in the update we finally commit (so that the column timestamps are coherent with the order decided by Paxos)?
        • Currently, the value returned by the cas method doesn't mean what it means for CAS in general. Namely, a false might just mean that we've had one refusal amongst the quorum first received responses, or that we've had to "replay" a previous round first, and this irrelevant of whether our CAS applies or not. I strongly believe we should return false only if the CAS doesn't apply, but otherwise we should just restart a new proposal (probably after some small random delay) until we are allowed to propose our value. Because otherwise:
          • the behavior will be inintuitive since it differs from the usual behavior
          • in almost all use case I can come up with, it will basically force users to do a read every time the cas method return false, because you have to decide whether your CAS indeed doesn't apply or something else.
          • this leeks implementation details.
        • It would be nice to add a comment on what problem the inProgressBallot.equals(ballot) check in PaxosState.commit fixes.
        • Can't we avoid FQRow? We can get back the keyspace from cf contained in the row for instance (this wouldn't work if said cf was null, but we don't have that case since it makes no sense to provide a null cf to SP.cas).

        And two very small nits:

        • In MessagingService, the comment on using padding should be moved before UNUSED_1.
        • In ProposeCallback, successful.addAndGet(1) -> sucessfull.incrementAndGet().
        Show
        Sylvain Lebresne added a comment - There's a unclear part on the algorithm itself for which I'm not sure what is the intent. During prepare, if a replica "promise", the ballot it sends back in the response is the one it just got (the one the proposer sent). So in the proposer (in SP.cas()), summary.inProgressBallot is necessary our own ballot (unless it's a reject, but then we don't care anymore). Meaning, that in SP.cas(), timeComparator.compare(summary.inProgressBallot, summary.mostRecentCommitted) >= 0 could be simplified in timeComparator.compare(ballot, summary.mostRecentCommitted) >= 0 . But it also mean that in PrepareCallback, the inProgressUpdate we keep is pretty much chosen randomly (since all inProgressBallot will in fact be equal to ballot ). Was that the intent? Or was the intent that during prepare, when the replica promise, it returns the previous inProgressBalot, i.e. the one before setting the new balot? (I think there might be a problem with both choice but before getting to that I want to make sure what is the initial intent). Some other remarks while I'm at it: PaxosState.propose always return a true as first argument of PrepareResponse (it always "promised"). mostRecentCommitted doesn't seem to be ever set. I don't think the commit business work. Commit segments can be deleted at any time due to flush, so I don't see how we can guarantee the persistency of the paxos state. Furthermore, when we replay the commit log paxos entry, we don't re-append them to the commit log, so if a node restart, play his log and shutdown right away, it'll lost his paxos state too. Why not just use a System table for the Paxos state? (I don't even think performance would be a big issue because we can do queries by names that are relatively cheap and besides most of the paxos state is deleted by commit, so the only part that will end up in sstables is the mostRecentCommitted, but that's small and very very cacheable). I'm confused by FBUtilities.timeComparator. I'm not sure what is the intent of using/comparing the clockSequence first, but I'm pretty sure this is broken. Should that compare the timestamps of the UUIDs (The clock sequence is not the timestamp). Furthermore, wasn't the goal to have a comparator to have it only compare and timestamps (and thus not break tie on same timestamp)? Lastly, wasn't the goal to reuse the ballot timestamp as the timestamp of the columns in the update we finally commit (so that the column timestamps are coherent with the order decided by Paxos)? Currently, the value returned by the cas method doesn't mean what it means for CAS in general. Namely, a false might just mean that we've had one refusal amongst the quorum first received responses, or that we've had to "replay" a previous round first, and this irrelevant of whether our CAS applies or not. I strongly believe we should return false only if the CAS doesn't apply, but otherwise we should just restart a new proposal (probably after some small random delay) until we are allowed to propose our value. Because otherwise: the behavior will be inintuitive since it differs from the usual behavior in almost all use case I can come up with, it will basically force users to do a read every time the cas method return false, because you have to decide whether your CAS indeed doesn't apply or something else. this leeks implementation details. It would be nice to add a comment on what problem the inProgressBallot.equals(ballot) check in PaxosState.commit fixes. Can't we avoid FQRow? We can get back the keyspace from cf contained in the row for instance (this wouldn't work if said cf was null, but we don't have that case since it makes no sense to provide a null cf to SP.cas). And two very small nits: In MessagingService, the comment on using padding should be moved before UNUSED_1. In ProposeCallback, successful.addAndGet(1) -> sucessfull.incrementAndGet().
        Hide
        Jonathan Ellis added a comment -

        During prepare, if a replica "promise", the ballot it sends back in the response is the one it just got

        This is a bug, it's supposed to reply with the ballot of a previously accepted value (so the coordinator can pick the highest such to re-propose).

        Why not just use a System table for the Paxos state?

        I was hoping to avoid that, but you're right.

        Should that compare the timestamps of the UUIDs (The clock sequence is not the timestamp). Furthermore, wasn't the goal to have a comparator to have it only compare and timestamps (and thus not break tie on same timestamp)?

        Yes, and yes.

        wasn't the goal to reuse the ballot timestamp as the timestamp of the columns in the update we finally commit

        I'm on board with that, but since we're generating ballots server-side that means we need to swallow CASSANDRA-5293...

        I strongly believe we should return false only if the CAS doesn't apply, but otherwise we should just restart a new proposal

        I agree, but on the other hand "just retry automatically" is something that we avoid elsewhere for good reason (arbitrarily high latency with no visibility to the client what is going on). I also considered introducing a new exception type explaining the state, but throwing an exception when everything is working fine is also not intuitive. So I think all the options kind of suck, but this one sucks least IMO.

        Fixes pushed to the same branch.

        Show
        Jonathan Ellis added a comment - During prepare, if a replica "promise", the ballot it sends back in the response is the one it just got This is a bug, it's supposed to reply with the ballot of a previously accepted value (so the coordinator can pick the highest such to re-propose). Why not just use a System table for the Paxos state? I was hoping to avoid that, but you're right. Should that compare the timestamps of the UUIDs (The clock sequence is not the timestamp). Furthermore, wasn't the goal to have a comparator to have it only compare and timestamps (and thus not break tie on same timestamp)? Yes, and yes. wasn't the goal to reuse the ballot timestamp as the timestamp of the columns in the update we finally commit I'm on board with that, but since we're generating ballots server-side that means we need to swallow CASSANDRA-5293 ... I strongly believe we should return false only if the CAS doesn't apply, but otherwise we should just restart a new proposal I agree, but on the other hand "just retry automatically" is something that we avoid elsewhere for good reason (arbitrarily high latency with no visibility to the client what is going on). I also considered introducing a new exception type explaining the state, but throwing an exception when everything is working fine is also not intuitive. So I think all the options kind of suck, but this one sucks least IMO. Fixes pushed to the same branch.
        Hide
        Sylvain Lebresne added a comment -

        it's supposed to reply with the ballot of a previously accepted value

        Ok. But then I think we have the following problem (at least I don't see what prevent this from happening): say you have 3 replica A, B, C and I'll use MRC for the 'most-recently committed' value, IPB for 'in-progress ballot' and IPU for 'in-porgress update'. I'll also use t0, t1, ... for ballots (where t0 < t1 < t2 ...), P0, P1, ... for proposers (any coordinator) and c0, c1, c2 for different column names (with say int values) inside the same row.

        • At t0, P0 wants to do cas(c0, null, 3). He prepares, gets all acks and propose c0=3 that all node acks. However, C dies before receiving the commit. At this point, we have the following states ():
          Node MRC IPB IPC In storage
          A t0     c0=3
          B t0     c0=3
          C   t0 c0=3  
        • At t1, P1 wants to do cas(c1, null, 5). He prepares and proposes with A and B answering (so on prepare he gets no IPB). However, during commit, B fails and only A gets it. At that point, C comes back up (but don't get the commit). The state is now:
          Node MRC IPB IPC In storage
          A t1     c0=3,c1=5
          B t0 t1 c1=5 c0=3
          C   t0 c0=3  
        • At t2, P3 wants to do cas(c2, null, 7). He gets back from A and C. From A, he gets MRC=t1 and from B, IPB=t0. So IPB < MRC, and the proposer decides he can proceed with his own value, because C is old and irrelevant. He thus propose, both A and C acknowledge and both get the final commit. The state is now:
          Node MRC IPB IPC In storage
          A t2     c0=3,c1=5,c2=7
          B t0 t1 c1=5 c0=3
          C t2     c2=7
        • Now B comes back. Then, at t3, P3 wants to do cas(c2, 7, 9). Everyone replica gets all message, the cas apply (whether B answer or not, since it's IPB is now "too old") and we end-up in the following state:
          Node MRC IPB IPC In storage
          A t3     c0=3,c1=5,c2=9
          B t3     c0=3,c2=9
          C t3     c2=9
        • Now at t4, P4 wants to do cas(c1, null, 3). The prepare will work in any case, and so P4 will read the value of c1 at QUORUM before the propose. However, if B and C answer first, then we don't have the correct value. We get c1 = null, making the CAS apply, while in theory c1 = 5 so the CAS shouldn't apply.

        In plain english, I think the problem is that we use successful round of CAS to serialize your CAS (which is standard). However, we delete the CAS state of the previous round pretty much as soon as one node has learn the value. We do replay the previous round if some of the replica that answer to prepare are on that previous round, but if replica are on an older round (than the previous one), we do move one, and thus might erase state while only one replica has learn, which is not ok.

        One option could be to ensure that a quorum of node has learned before starting a new round (i.e. accepting a new value on the prepare phase). But I think this require to keep the last-recently-committed 'value' too.

        Show
        Sylvain Lebresne added a comment - it's supposed to reply with the ballot of a previously accepted value Ok. But then I think we have the following problem (at least I don't see what prevent this from happening): say you have 3 replica A, B, C and I'll use MRC for the 'most-recently committed' value, IPB for 'in-progress ballot' and IPU for 'in-porgress update'. I'll also use t0, t1, ... for ballots (where t0 < t1 < t2 ...), P0, P1, ... for proposers (any coordinator) and c0, c1, c2 for different column names (with say int values) inside the same row. At t0, P0 wants to do cas(c0, null, 3). He prepares, gets all acks and propose c0=3 that all node acks. However, C dies before receiving the commit. At this point, we have the following states (): Node MRC IPB IPC In storage A t0     c0=3 B t0     c0=3 C   t0 c0=3   At t1, P1 wants to do cas(c1, null, 5). He prepares and proposes with A and B answering (so on prepare he gets no IPB). However, during commit, B fails and only A gets it. At that point, C comes back up (but don't get the commit). The state is now: Node MRC IPB IPC In storage A t1     c0=3,c1=5 B t0 t1 c1=5 c0=3 C   t0 c0=3   At t2, P3 wants to do cas(c2, null, 7). He gets back from A and C. From A, he gets MRC=t1 and from B, IPB=t0. So IPB < MRC, and the proposer decides he can proceed with his own value, because C is old and irrelevant. He thus propose, both A and C acknowledge and both get the final commit. The state is now: Node MRC IPB IPC In storage A t2     c0=3,c1=5,c2=7 B t0 t1 c1=5 c0=3 C t2     c2=7 Now B comes back. Then, at t3, P3 wants to do cas(c2, 7, 9). Everyone replica gets all message, the cas apply (whether B answer or not, since it's IPB is now "too old") and we end-up in the following state: Node MRC IPB IPC In storage A t3     c0=3,c1=5,c2=9 B t3     c0=3,c2=9 C t3     c2=9 Now at t4, P4 wants to do cas(c1, null, 3). The prepare will work in any case, and so P4 will read the value of c1 at QUORUM before the propose. However, if B and C answer first, then we don't have the correct value. We get c1 = null, making the CAS apply, while in theory c1 = 5 so the CAS shouldn't apply. In plain english, I think the problem is that we use successful round of CAS to serialize your CAS (which is standard). However, we delete the CAS state of the previous round pretty much as soon as one node has learn the value. We do replay the previous round if some of the replica that answer to prepare are on that previous round, but if replica are on an older round (than the previous one), we do move one, and thus might erase state while only one replica has learn, which is not ok. One option could be to ensure that a quorum of node has learned before starting a new round (i.e. accepting a new value on the prepare phase). But I think this require to keep the last-recently-committed 'value' too.
        Hide
        Jonathan Ellis added a comment -

        One option could be to ensure that a quorum of node has learned before starting a new round

        Right. But I don't think keeping the last commit value fixes this for the general case, since a replica can miss arbitrarily many updates while the other two nodes are the quorum.

        We could hint the commit, but that doesn't help since the next CAS request could go to a different coordinator.

        So I think we'd have to hint to one of the other replicas, and have the replica include whether it has un-replayed hints for its peers in the promise (and disqualify the target from the quorum until it does get replayed).

        But that still leaves a window for all the replicas to become unavailable to the coordinator between (partial) learn, and hint generation.

        We could use a batchlog-like approach, where each peer generates a hint precursor that it will turn into an actual hint if it doesn't get the all-clear from the coordinator in time. (I think we might even be able to use the actual batchlog code here.) Performance might even be acceptable, judging from the batchlog precedent...

        This is definitely messy though. Better ideas?

        Show
        Jonathan Ellis added a comment - One option could be to ensure that a quorum of node has learned before starting a new round Right. But I don't think keeping the last commit value fixes this for the general case, since a replica can miss arbitrarily many updates while the other two nodes are the quorum. We could hint the commit, but that doesn't help since the next CAS request could go to a different coordinator. So I think we'd have to hint to one of the other replicas, and have the replica include whether it has un-replayed hints for its peers in the promise (and disqualify the target from the quorum until it does get replayed). But that still leaves a window for all the replicas to become unavailable to the coordinator between (partial) learn, and hint generation. We could use a batchlog-like approach, where each peer generates a hint precursor that it will turn into an actual hint if it doesn't get the all-clear from the coordinator in time. (I think we might even be able to use the actual batchlog code here.) Performance might even be acceptable, judging from the batchlog precedent... This is definitely messy though. Better ideas?
        Hide
        Sylvain Lebresne added a comment -

        But I don't think keeping the last commit value fixes this for the general case

        What I had in mind was that on result of the prepare, we would wait for a quorum of people agreeing on the last MRC (i.e. a quorum has learn the last value) before allowing to proceed with our own value. Provided we always ensure that, then we can ensure that for every column in a CAS, then in any quorum at least one node has seen the last update for that column. However, if we don't do anything more, we may not be able to make progress, because if for a quorum of 2, the answer we get from a prepare is say MRC=t2 from node A and some very old state from node B, then we can't proceed because we don't have a quorum agreeing on t2 MRC, but we can't make progress either because we don't know how to bring B to MRC=t2. But if A had sent the most-recent-committed value corresponding to t2, then we could repair B (making it learn basically) and then start again the algorithm (we can even optimize a bit by sending the repair in the same message than the new prepare). Note that B might have missed more than the last commit, but as long as we make sure a QUORUM learn the value of every round before proposing, we're good.

        Show
        Sylvain Lebresne added a comment - But I don't think keeping the last commit value fixes this for the general case What I had in mind was that on result of the prepare, we would wait for a quorum of people agreeing on the last MRC (i.e. a quorum has learn the last value) before allowing to proceed with our own value. Provided we always ensure that, then we can ensure that for every column in a CAS, then in any quorum at least one node has seen the last update for that column. However, if we don't do anything more, we may not be able to make progress, because if for a quorum of 2, the answer we get from a prepare is say MRC=t2 from node A and some very old state from node B, then we can't proceed because we don't have a quorum agreeing on t2 MRC, but we can't make progress either because we don't know how to bring B to MRC=t2. But if A had sent the most-recent-committed value corresponding to t2, then we could repair B (making it learn basically) and then start again the algorithm (we can even optimize a bit by sending the repair in the same message than the new prepare). Note that B might have missed more than the last commit, but as long as we make sure a QUORUM learn the value of every round before proposing, we're good.
        Hide
        Jonathan Ellis added a comment -

        Oh.

        Yeah, that's a lot simpler.

        Show
        Jonathan Ellis added a comment - Oh. Yeah, that's a lot simpler.
        Hide
        Jonathan Ellis added a comment -

        Added code to catch up replicas missing the most recent commit, rebased against trunk, and pushed to https://github.com/jbellis/cassandra/commits/5062-3. (Refer to -2 if you need the earlier fixes broken into separate commits.)

        Show
        Jonathan Ellis added a comment - Added code to catch up replicas missing the most recent commit, rebased against trunk, and pushed to https://github.com/jbellis/cassandra/commits/5062-3 . (Refer to -2 if you need the earlier fixes broken into separate commits.)
        Hide
        Sylvain Lebresne added a comment -

        I haven't finished reviewing yet, but since I'll have to stop for today, here's a bunch of early remarks.

        In SP.preparePaxos, if we throw if !mostRecentCommitHasQuorum and since PrepareCallback only waits for quorum response, I'm not sure later replicasMissingMostRecentCommit can ever return anything other than empty, and so I don't think we'll make progress. Furthermore, this means that that we'll throw as soon as one node of the quorum first responder is not up to date, which feels overly pessimistic.

        So I think we should remove the mostRecentCommitHasQuorum check in preparePaxos. But then, once we're received the prepare responses, we must still ensure that a quorum of node do are on the MRC before moving on. So if we do have missingMRC > 0, we would do the commit as is now, but then we would re-start the prepare phase, re-asking for promisses. I think it is necessary to re-prepare after the commits anyway because those commit will erase the paxos state on the replica, so some of the promise we got are not valid anymore. Another observation is that for the "repair commit" of the missingMRC, we can (and should in fact) use the MRC as ballot. So we can optimize this commit-and-re-prepare by sending just one special message (and only to the missingMRC nodes).

        I agree, but on the other hand "just retry automatically" is something that we avoid elsewhere for good reason (arbitrarily high latency with no visibility to the client what is going on). I also considered introducing a new exception type explaining the state, but throwing an exception when everything is working fine is also not intuitive. So I think all the options kind of suck, but this one sucks least IMO.

        I disagree, I think that is the worst option tbh. That fact is, in almost all cases, when a user want to CAS something, he will need a "real" true or false, i.e. weither his CAS applies or not. Until then, an answer that says "sorry but due to implementation details we haven't tested you CAS yet" is useless and the user will retry 100% of the time. So retrying server side seems to me like the right and the most efficient one. Throwing a specific exception could work I suppose, but in the end it's just pushing the retry to client and since there is nothing better to do than retry, this will increase latency and expose more implementation details with no benefits that I can see. But returning false in those case, mean that user can't do anything with the false value. That's problematic because it means users will have to read to know if a false is a real false, or just "the algorithm haven't win on the first round" (in which case it still has to retry).

        I do want to note that "the algorithm haven't win on the first round" can happen for reasons completely orthogonal to the user CAS, if only because the CAS is at the row level, and thus a CAS on a column c1 will "conflict" with one on column c2 (and in fact, due to the implementation of PaxosState, there is even some random cross-keys conflicts). I.e. a false currently does not even guarantee that there was a real conflict between the user CAS and some other CAS.

        Also, I don't think this is at all comparable to other places when we don't retry automatically so I don't buy that argument. Besides, CASSANDRA-4705 does implement some form of automatic retry so there is prior art. But to be clear, I do not suggest that we should retry indefinitely, we should absolutely take the rpc timeout into account. I also think that "you CAS will take longer if there is contention" is something pretty intuitive to understand (even if, as said above, the contention can be slightly artificial in practice, due to implementation details).

        Also:

        • Nit: In CFMetaData PaxosCF definition: "propsal" -> "proposal"
        • When getting the prepare response, I think that as soon as we get a response that is not promised, we can stop waiting for the other responses. I don't think we win anything by waiting.
        Show
        Sylvain Lebresne added a comment - I haven't finished reviewing yet, but since I'll have to stop for today, here's a bunch of early remarks. In SP.preparePaxos, if we throw if !mostRecentCommitHasQuorum and since PrepareCallback only waits for quorum response, I'm not sure later replicasMissingMostRecentCommit can ever return anything other than empty, and so I don't think we'll make progress. Furthermore, this means that that we'll throw as soon as one node of the quorum first responder is not up to date, which feels overly pessimistic. So I think we should remove the mostRecentCommitHasQuorum check in preparePaxos. But then, once we're received the prepare responses, we must still ensure that a quorum of node do are on the MRC before moving on. So if we do have missingMRC > 0, we would do the commit as is now, but then we would re-start the prepare phase, re-asking for promisses. I think it is necessary to re-prepare after the commits anyway because those commit will erase the paxos state on the replica, so some of the promise we got are not valid anymore. Another observation is that for the "repair commit" of the missingMRC, we can (and should in fact) use the MRC as ballot. So we can optimize this commit-and-re-prepare by sending just one special message (and only to the missingMRC nodes). I agree, but on the other hand "just retry automatically" is something that we avoid elsewhere for good reason (arbitrarily high latency with no visibility to the client what is going on). I also considered introducing a new exception type explaining the state, but throwing an exception when everything is working fine is also not intuitive. So I think all the options kind of suck, but this one sucks least IMO. I disagree, I think that is the worst option tbh. That fact is, in almost all cases, when a user want to CAS something, he will need a "real" true or false, i.e. weither his CAS applies or not. Until then, an answer that says "sorry but due to implementation details we haven't tested you CAS yet" is useless and the user will retry 100% of the time. So retrying server side seems to me like the right and the most efficient one. Throwing a specific exception could work I suppose, but in the end it's just pushing the retry to client and since there is nothing better to do than retry, this will increase latency and expose more implementation details with no benefits that I can see. But returning false in those case, mean that user can't do anything with the false value. That's problematic because it means users will have to read to know if a false is a real false, or just "the algorithm haven't win on the first round" (in which case it still has to retry). I do want to note that "the algorithm haven't win on the first round" can happen for reasons completely orthogonal to the user CAS, if only because the CAS is at the row level, and thus a CAS on a column c1 will "conflict" with one on column c2 (and in fact, due to the implementation of PaxosState, there is even some random cross-keys conflicts). I.e. a false currently does not even guarantee that there was a real conflict between the user CAS and some other CAS. Also, I don't think this is at all comparable to other places when we don't retry automatically so I don't buy that argument. Besides, CASSANDRA-4705 does implement some form of automatic retry so there is prior art. But to be clear, I do not suggest that we should retry indefinitely, we should absolutely take the rpc timeout into account. I also think that "you CAS will take longer if there is contention" is something pretty intuitive to understand (even if, as said above, the contention can be slightly artificial in practice, due to implementation details). Also: Nit: In CFMetaData PaxosCF definition: "propsal" -> "proposal" When getting the prepare response, I think that as soon as we get a response that is not promised, we can stop waiting for the other responses. I don't think we win anything by waiting.
        Hide
        Sylvain Lebresne added a comment -

        A few other remarks on v3:

        • In SP.cas(), after having read the CASed values, I don't think Objects.equals() will work because of the timestamps. Which makes me remark that in theory having the API take a ColumnFamily as 'expected' is slightly weird because we will ignore the timestamps (I don't suppose the intent was to take them into account). Not a huge big deal (and not one we'll have on the CQL side), so probably not worth complicating said API, but may be worth a comment.
        • Also, the timestamp of the columns we write during the commit should be the ballot timestamp. I don't see it done by the patch.
        • SystemTable.savePaxosCommit doesn't seem to delete in_progress_ballot (it deletes 'proposal' only).
        • In SystemTable.loadPaxosState, we must be careful that UntypedResultSet.Row doesn't handle null well (and Row.fromBytes doesn't either).
        • It's more of a nit, but I think there is a bunch of places where we could reuse the new Commit class. In particular, ProposeRequest is kind of a duplicate. But there is a few other places too (the propose and commit method in PaxosState could take a Commit directly, in PrepareCallback, the inProgress ballot and update could be grouped, ...).
        • Also, the paxos state CF basically holds 2 pairs of ballot/update, the in_progress ones and the MRC ones. Could be nice to name them to reflect that symetry (something like in_progress_ballot/in_progress_upd/most_recent_commit_ballot/most_recent_commit_upd)?
        • The patch has minors updates to the cassandra.in.sh file that should probably be reverted.

        For the record, I'm not a fan of having PaxosState share "buckets" (versus directly reading the SystemTable). I kind of understand the reasoning of keeping things in memory, but I can't stop to think that it's premature optimisation (that creates artificial contention and make the code slightly harder to reason about imo). Anyway, as far as I can tell it's not incorrect, but I'm not totally sold on it.

        Also, I think we're lacking something around reads and around CL.SERIAL. I think we at least need to support CL.SERIAL reads, that would have to go through paxos (so they play inProgress values). But I would a relatively big fan to also allow a (non-SERIAL) CL in cas() that would be applied to the final commit of the cas write. That way, if you have a successful cas call at CL.QUORUM, you can do a normal QUORUM read and be guaranteed to see your value, which is nice (of course, if the cas write actually timeout, you'd have to do a SERIAL read to check the up to date state, but at least in the normal non-failing case you don't need SERIAL reads).

        Show
        Sylvain Lebresne added a comment - A few other remarks on v3: In SP.cas(), after having read the CASed values, I don't think Objects.equals() will work because of the timestamps. Which makes me remark that in theory having the API take a ColumnFamily as 'expected' is slightly weird because we will ignore the timestamps (I don't suppose the intent was to take them into account). Not a huge big deal (and not one we'll have on the CQL side), so probably not worth complicating said API, but may be worth a comment. Also, the timestamp of the columns we write during the commit should be the ballot timestamp. I don't see it done by the patch. SystemTable.savePaxosCommit doesn't seem to delete in_progress_ballot (it deletes 'proposal' only). In SystemTable.loadPaxosState, we must be careful that UntypedResultSet.Row doesn't handle null well (and Row.fromBytes doesn't either). It's more of a nit, but I think there is a bunch of places where we could reuse the new Commit class. In particular, ProposeRequest is kind of a duplicate. But there is a few other places too (the propose and commit method in PaxosState could take a Commit directly, in PrepareCallback, the inProgress ballot and update could be grouped, ...). Also, the paxos state CF basically holds 2 pairs of ballot/update, the in_progress ones and the MRC ones. Could be nice to name them to reflect that symetry (something like in_progress_ballot/in_progress_upd/most_recent_commit_ballot/most_recent_commit_upd)? The patch has minors updates to the cassandra.in.sh file that should probably be reverted. For the record, I'm not a fan of having PaxosState share "buckets" (versus directly reading the SystemTable). I kind of understand the reasoning of keeping things in memory, but I can't stop to think that it's premature optimisation (that creates artificial contention and make the code slightly harder to reason about imo). Anyway, as far as I can tell it's not incorrect, but I'm not totally sold on it. Also, I think we're lacking something around reads and around CL.SERIAL. I think we at least need to support CL.SERIAL reads, that would have to go through paxos (so they play inProgress values). But I would a relatively big fan to also allow a (non-SERIAL) CL in cas() that would be applied to the final commit of the cas write. That way, if you have a successful cas call at CL.QUORUM, you can do a normal QUORUM read and be guaranteed to see your value, which is nice (of course, if the cas write actually timeout, you'd have to do a SERIAL read to check the up to date state, but at least in the normal non-failing case you don't need SERIAL reads).
        Hide
        Jonathan Ellis added a comment -

        Working on the code, but I don't think PaxosState introduces artificial contention in the sense that we do need to serialize paxos operations, so this is definitely better than One Big Lock. (I also like bounding the number of in-progress paxos states we have.)

        Show
        Jonathan Ellis added a comment - Working on the code, but I don't think PaxosState introduces artificial contention in the sense that we do need to serialize paxos operations, so this is definitely better than One Big Lock. (I also like bounding the number of in-progress paxos states we have.)
        Hide
        Jonathan Ellis added a comment -

        Also, I think we're lacking something around reads and around CL.SERIAL. I think we at least need to support CL.SERIAL reads, that would have to go through paxos (so they play inProgress values). But I would a relatively big fan to also allow a (non-SERIAL) CL in cas() that would be applied to the final commit of the cas write

        Yes, I think I mentioned both of these as "stuff that is missing." Would rather make them follow-on tickets.

        Show
        Jonathan Ellis added a comment - Also, I think we're lacking something around reads and around CL.SERIAL. I think we at least need to support CL.SERIAL reads, that would have to go through paxos (so they play inProgress values). But I would a relatively big fan to also allow a (non-SERIAL) CL in cas() that would be applied to the final commit of the cas write Yes, I think I mentioned both of these as "stuff that is missing." Would rather make them follow-on tickets.
        Hide
        Jonathan Ellis added a comment - - edited

        In SP.preparePaxos, if we throw if !mostRecentCommitHasQuorum and since PrepareCallback only waits for quorum response

        We wait for as many nodes as were alive at the start of the paxos request.

        I think it is necessary to re-prepare after the commits anyway because those commit will erase the paxos state on the replica, so some of the promise we got are not valid anymore.

        We don't erase the promise, and we shouldn't, because that would allow us to accept an out-of-date propose (that got delayed in the intertubes from another proposer, for example – does not require buggy proposers).

        So retrying server side seems to me like the right and the most efficient one.

        All right, although pushing the side effect back to the client also makes it much less likely to race with a proposer that has interrupted us. Added some sleep(0.100) to attempt to make this less problematic.

        When getting the prepare response, I think that as soon as we get a response that is not promised, we can stop waiting for the other responses.

        Agreed, but the logic to do this gets kind of ugly. Don't think it's worth addressing.

        the timestamp of the columns we write during the commit should be the ballot timestamp

        What does this buy us? I don't think we need it to allow mixing CAS/non-CAS on a row, and it would make using non-clock-timestamps impossible.

        SystemTable.savePaxosCommit doesn't seem to delete in_progress_ballot

        As above, this is deliberate.

        there is a bunch of places where we could reuse the new Commit class

        I prefer to keep ballot/proposal separate except for the MRC since promise/accept are done at different times.

        Show
        Jonathan Ellis added a comment - - edited In SP.preparePaxos, if we throw if !mostRecentCommitHasQuorum and since PrepareCallback only waits for quorum response We wait for as many nodes as were alive at the start of the paxos request. I think it is necessary to re-prepare after the commits anyway because those commit will erase the paxos state on the replica, so some of the promise we got are not valid anymore. We don't erase the promise, and we shouldn't, because that would allow us to accept an out-of-date propose (that got delayed in the intertubes from another proposer, for example – does not require buggy proposers). So retrying server side seems to me like the right and the most efficient one. All right, although pushing the side effect back to the client also makes it much less likely to race with a proposer that has interrupted us. Added some sleep(0.100) to attempt to make this less problematic. When getting the prepare response, I think that as soon as we get a response that is not promised, we can stop waiting for the other responses. Agreed, but the logic to do this gets kind of ugly. Don't think it's worth addressing. the timestamp of the columns we write during the commit should be the ballot timestamp What does this buy us? I don't think we need it to allow mixing CAS/non-CAS on a row, and it would make using non-clock-timestamps impossible. SystemTable.savePaxosCommit doesn't seem to delete in_progress_ballot As above, this is deliberate. there is a bunch of places where we could reuse the new Commit class I prefer to keep ballot/proposal separate except for the MRC since promise/accept are done at different times.
        Hide
        Jonathan Ellis added a comment -

        Pushed updates to https://github.com/jbellis/cassandra/commits/5062-4 and created subtasks. (Rebased again to get CASSANDRA-3783.)

        Show
        Jonathan Ellis added a comment - Pushed updates to https://github.com/jbellis/cassandra/commits/5062-4 and created subtasks. (Rebased again to get CASSANDRA-3783 .)
        Hide
        Sylvain Lebresne added a comment -

        I don't think PaxosState introduces artificial contention in the sense that we do need to serialize paxos operations, so this is definitely better than One Big Lock

        Sorry that wasn't clear, I was think of contention between proposers, not contention local to a replica. Because we bucket the paxos states, we will reuse the same state for different keys. So in other words, the current algorithm does not do Paxos at the row level, but rather paxos at the level of all rows whose key's hashcode modulo 1024 is equal, and 2 proposers on 2 different row keys may compete with each other.

        And since proposer contention is so damn expensive, I wonder if that artificial contention won't trump any gain we have from keeping the paxos state in memory (in cases where people have a non-trivial load of CAS operation, even though it is on different row keys). Sure we could play with the STATE_BUCKETS value, but I'm still of the opinion that it's premature optimization at this point.

        On the part about serializing Paxos operations on a given replica, I agree with you that it's necessary, and for that we could absolutely use lock buckets (like we use to do for 2ndary indexes) to optimize. But we can do that without bucketing the whole paxos states.

        As a side note, I think that we may want to TTL the Paxos state (we do store the MRC update after all, if someone has done an CAS on a row and don't do any CAS anymore, it would be sad to keep that MRC forever). And we can use a setting like gc_grace, since provided you do guarantee that everything older than gc_grace is consistent, then you can forget about older than gc_grace states. Of course, in practice a setting separated from gc_grace would make sense.

        Would rather make them follow-on tickets

        I'm fine with that. Just wanted to mention it cause I don't think we should release CAS without those tickets done. Maybe we can create a top-level ticket to group all those tickets, making sure we don't forget something important?

        We wait for as many nodes as were alive at the start of the paxos request.

        Fair enough, I misread that part. But, this in turn means that we'll timeout as soon as any node that we though were alive happens to be dead, even if we have a QUORUM of responses. And since the failure detector takes in general at least a handful of seconds to detect a node, it means that each time a node dies, all CAS operations will timeout until at least one other node detects it dead (and while it's true that currently some reads may timeout when a node dies, this is at least not all reads and besides, it's known to be bad for latency and one of the reason why CASSANDRA-4705 is useful). Overall I think we should make progress as soon as we have QUORUM answers.

        Furthermore, another problem is that we throw an exception if we haven't a quorum on the last MRC. But that means in turns that if amongst all live nodes we don't have a QUORUM of nodes that have learn that last MRC, we don't do anything to fix the issue (we throw an exception in fact). So if we have really lost messages, until something external magically repair the nodes that are missing the last MRC, we'll be in a state when no CAS operation on the row can proceed at all. Even believing that hinted handoff will be that external repair, which would assume that hinted handoffs are bullet-proof (and they are not), there could still be a reasonably long window of unavailability just because HH delivery is not prioritized/timely.

        Lastly, I note that it's not unlikely not to get a QUORUM of node on the last MRC due to timing, because the last MRC commit may have reach some replica but not all of them (it will even likely be common when proposers contend, because the one that gets to be 2nd will do it's prepare phase more or less at the same time the previous one does his commit). And in that case, note that nothing is wrong with the cluster per se, all replica may be alive. And thus throwing an UnavailableException in that case would imo be a very unfriendly user experience.

        So to recall, what I suggest instead would be to 1) only wait for QUORUM responses on prepare then 2) if not everyone is on the same last MRC, then "repair" those are aren't by committing said last MRC (with the MRC ballot in particular, not our ballot. This is not part of the Paxos algorithm per-se, this is just us re-sending the learn message of the previous round to make sure a quorum has it) and then 3) restart from the beginning (i.e. re-prepare).

        I believe this does not suffer from the problems above: we prepare, and if the first quorum we get hasn't got the last MRC yet (which can be due to timing or because of some lost message, we don't know but shouldn't assume either one), we sent it to them and we start over again.

        Then we can probably optimize a bit, since we know when we "repair" the replica that we're going to re-prepare just afterwards, we could do both in one message. This does complicate the logic a bit though and since that's just an optimization, this could clearly be moved to a follow-up ticket.

        We don't erase the promise, and we shouldn't, because that would allow us to accept an out-of-date propose

        I don't think that matters because we have the MRC to avoid accepting out-of-date proposal (and the logic already does it). In all fairness, keeping the promise is not un-correct either, it's just that I think it's useless to keep around.

        Agreed, but the logic to do this gets kind of ugly

        Can't we just say in PrepareCallback that if promised != true for a response, then we do a 'while (latch.getCount() > 0) latch.countDown()'? Without pretending this is the most elegant code ever, this seems simple enough to be worth it to me.

        What does this buy us?

        I'd say correctness . I we bother with Paxos to serialize our commits, I think we want to guarantee that the column timestamp resolution won't break that order. If I CAS more or less simultaneously 0->1 and 1->2, then if the 2nd one return true, it means it has been serialized after the first one, so I should be guaranteed to not get 1 when I read, but I could if due to clock skew or other the value 2 gets a timestamp lower than 1.

        I prefer to keep ballot/proposal separate except for the MRC since promise/accept are done at different times

        My reasoning is that the high-level view of the algorithm is: you prepare with some commmit ballot to see the state of the algorithm, and then you propose a commit, and commit it if it's accepted. So I still think using Commit for propose and commit would be cleaner and merge Commit and ProposeRequest which are really the same. Anyway, it's a detail I suppose.

        Show
        Sylvain Lebresne added a comment - I don't think PaxosState introduces artificial contention in the sense that we do need to serialize paxos operations, so this is definitely better than One Big Lock Sorry that wasn't clear, I was think of contention between proposers, not contention local to a replica. Because we bucket the paxos states, we will reuse the same state for different keys. So in other words, the current algorithm does not do Paxos at the row level, but rather paxos at the level of all rows whose key's hashcode modulo 1024 is equal, and 2 proposers on 2 different row keys may compete with each other. And since proposer contention is so damn expensive, I wonder if that artificial contention won't trump any gain we have from keeping the paxos state in memory (in cases where people have a non-trivial load of CAS operation, even though it is on different row keys). Sure we could play with the STATE_BUCKETS value, but I'm still of the opinion that it's premature optimization at this point. On the part about serializing Paxos operations on a given replica, I agree with you that it's necessary, and for that we could absolutely use lock buckets (like we use to do for 2ndary indexes) to optimize. But we can do that without bucketing the whole paxos states. As a side note, I think that we may want to TTL the Paxos state (we do store the MRC update after all, if someone has done an CAS on a row and don't do any CAS anymore, it would be sad to keep that MRC forever). And we can use a setting like gc_grace, since provided you do guarantee that everything older than gc_grace is consistent, then you can forget about older than gc_grace states. Of course, in practice a setting separated from gc_grace would make sense. Would rather make them follow-on tickets I'm fine with that. Just wanted to mention it cause I don't think we should release CAS without those tickets done. Maybe we can create a top-level ticket to group all those tickets, making sure we don't forget something important? We wait for as many nodes as were alive at the start of the paxos request. Fair enough, I misread that part. But , this in turn means that we'll timeout as soon as any node that we though were alive happens to be dead, even if we have a QUORUM of responses. And since the failure detector takes in general at least a handful of seconds to detect a node, it means that each time a node dies, all CAS operations will timeout until at least one other node detects it dead (and while it's true that currently some reads may timeout when a node dies, this is at least not all reads and besides, it's known to be bad for latency and one of the reason why CASSANDRA-4705 is useful). Overall I think we should make progress as soon as we have QUORUM answers. Furthermore, another problem is that we throw an exception if we haven't a quorum on the last MRC. But that means in turns that if amongst all live nodes we don't have a QUORUM of nodes that have learn that last MRC, we don't do anything to fix the issue (we throw an exception in fact). So if we have really lost messages, until something external magically repair the nodes that are missing the last MRC, we'll be in a state when no CAS operation on the row can proceed at all. Even believing that hinted handoff will be that external repair, which would assume that hinted handoffs are bullet-proof (and they are not), there could still be a reasonably long window of unavailability just because HH delivery is not prioritized/timely. Lastly, I note that it's not unlikely not to get a QUORUM of node on the last MRC due to timing, because the last MRC commit may have reach some replica but not all of them (it will even likely be common when proposers contend, because the one that gets to be 2nd will do it's prepare phase more or less at the same time the previous one does his commit). And in that case, note that nothing is wrong with the cluster per se, all replica may be alive. And thus throwing an UnavailableException in that case would imo be a very unfriendly user experience. So to recall, what I suggest instead would be to 1) only wait for QUORUM responses on prepare then 2) if not everyone is on the same last MRC, then "repair" those are aren't by committing said last MRC (with the MRC ballot in particular, not our ballot. This is not part of the Paxos algorithm per-se, this is just us re-sending the learn message of the previous round to make sure a quorum has it) and then 3) restart from the beginning (i.e. re-prepare). I believe this does not suffer from the problems above: we prepare, and if the first quorum we get hasn't got the last MRC yet (which can be due to timing or because of some lost message, we don't know but shouldn't assume either one), we sent it to them and we start over again. Then we can probably optimize a bit, since we know when we "repair" the replica that we're going to re-prepare just afterwards, we could do both in one message. This does complicate the logic a bit though and since that's just an optimization, this could clearly be moved to a follow-up ticket. We don't erase the promise, and we shouldn't, because that would allow us to accept an out-of-date propose I don't think that matters because we have the MRC to avoid accepting out-of-date proposal (and the logic already does it). In all fairness, keeping the promise is not un-correct either, it's just that I think it's useless to keep around. Agreed, but the logic to do this gets kind of ugly Can't we just say in PrepareCallback that if promised != true for a response, then we do a 'while (latch.getCount() > 0) latch.countDown()'? Without pretending this is the most elegant code ever, this seems simple enough to be worth it to me. What does this buy us? I'd say correctness . I we bother with Paxos to serialize our commits, I think we want to guarantee that the column timestamp resolution won't break that order. If I CAS more or less simultaneously 0->1 and 1->2, then if the 2nd one return true, it means it has been serialized after the first one, so I should be guaranteed to not get 1 when I read, but I could if due to clock skew or other the value 2 gets a timestamp lower than 1. I prefer to keep ballot/proposal separate except for the MRC since promise/accept are done at different times My reasoning is that the high-level view of the algorithm is: you prepare with some commmit ballot to see the state of the algorithm, and then you propose a commit, and commit it if it's accepted. So I still think using Commit for propose and commit would be cleaner and merge Commit and ProposeRequest which are really the same. Anyway, it's a detail I suppose.
        Hide
        Jonathan Ellis added a comment -

        in other words, the current algorithm does not do Paxos at the row level, but rather paxos at the level of all rows whose key's hashcode modulo 1024 is equal, and 2 proposers on 2 different row keys may compete with each other.

        Agreed, but what are you proposing as an alternative? Allocating a lock per row would be madness, and attempting to write it without locks looks very difficult.

        (I will note that the contention is probably not that bad, since the two rows would have to hash to the same lock across a majority of replicas, which in a vnode world is unlikely.)

        this in turn means that we'll timeout as soon as any node that we though were alive happens to be dead, even if we have a QUORUM of responses

        Not so – we will wait for it, but when it does not reply we will be able to continue as long as the nodes that did reply constitute a quorum. (I.e., we check responseCount vs requiredParticipants, not vs endpoints.size().)

        Can't we just say in PrepareCallback that if promised != true for a response, then we do a 'while (latch.getCount() > 0) latch.countDown()'?

        Sure, but then we have to add special cases back up to avoid throwing UAE, or move the promise check into PrepareCallback and have it throw a control-flow exception, neither of which appeals to me.

        I think we want to guarantee that the column timestamp resolution won't break that order

        You're right, this could get ugly with HH, RR, etc.

        Show
        Jonathan Ellis added a comment - in other words, the current algorithm does not do Paxos at the row level, but rather paxos at the level of all rows whose key's hashcode modulo 1024 is equal, and 2 proposers on 2 different row keys may compete with each other. Agreed, but what are you proposing as an alternative? Allocating a lock per row would be madness, and attempting to write it without locks looks very difficult. (I will note that the contention is probably not that bad, since the two rows would have to hash to the same lock across a majority of replicas, which in a vnode world is unlikely.) this in turn means that we'll timeout as soon as any node that we though were alive happens to be dead, even if we have a QUORUM of responses Not so – we will wait for it, but when it does not reply we will be able to continue as long as the nodes that did reply constitute a quorum. (I.e., we check responseCount vs requiredParticipants, not vs endpoints.size().) Can't we just say in PrepareCallback that if promised != true for a response, then we do a 'while (latch.getCount() > 0) latch.countDown()'? Sure, but then we have to add special cases back up to avoid throwing UAE, or move the promise check into PrepareCallback and have it throw a control-flow exception, neither of which appeals to me. I think we want to guarantee that the column timestamp resolution won't break that order You're right, this could get ugly with HH, RR, etc.
        Hide
        Sylvain Lebresne added a comment -

        Allocating a lock per row would be madness

        Again, I'm not talking about locks here. We can use one true paxos state per row (which is the alternative I'm suggesting) but still use a 'hashcode % 1024' number of locks for actually protecting the read/write to the paxos system table.

        What I am suggesting is that we wouldn't keep paxos states in memory however, i.e. we would always access the system table (and no, I don't necessarily think this would necessarily destroy performance (some of the reasons are in comment above). More importantly I think we shouldn't optimize this before having tested it, since this change the level at which the algorithm apply and that's not nothing).

        we will wait for it, but when it does not reply we will be able to continue as long as the nodes that did reply constitute a quorum

        Alright, didn't so we weren't throwing a timeout in that case. But we'll still wait rpcTimeout (10 seconds by default!), so I don't think it's much better. In particular because I really think that ultimately we should respect the overall write rpc timeout for CAS operation (not suggesting we should do it now, I'm fine having that being a follow up).

        Besides, there is still the 2 other problems I've mentionned.

        then we have to add special cases back up to avoid throwing UAE

        Granted, but I'm (strongly) convinced that throwing this UAE is wrong in the first place. Let say that provided checking 'callback.isPromised' is the first thing we do, which I believe we should, then there is no special casing to do.

        Show
        Sylvain Lebresne added a comment - Allocating a lock per row would be madness Again, I'm not talking about locks here. We can use one true paxos state per row (which is the alternative I'm suggesting) but still use a 'hashcode % 1024' number of locks for actually protecting the read/write to the paxos system table. What I am suggesting is that we wouldn't keep paxos states in memory however, i.e. we would always access the system table (and no, I don't necessarily think this would necessarily destroy performance (some of the reasons are in comment above). More importantly I think we shouldn't optimize this before having tested it, since this change the level at which the algorithm apply and that's not nothing). we will wait for it, but when it does not reply we will be able to continue as long as the nodes that did reply constitute a quorum Alright, didn't so we weren't throwing a timeout in that case. But we'll still wait rpcTimeout (10 seconds by default!), so I don't think it's much better. In particular because I really think that ultimately we should respect the overall write rpc timeout for CAS operation (not suggesting we should do it now, I'm fine having that being a follow up). Besides, there is still the 2 other problems I've mentionned. then we have to add special cases back up to avoid throwing UAE Granted, but I'm (strongly) convinced that throwing this UAE is wrong in the first place. Let say that provided checking 'callback.isPromised' is the first thing we do, which I believe we should, then there is no special casing to do.
        Hide
        Jonathan Ellis added a comment -

        We can use one true paxos state per row (which is the alternative I'm suggesting) but still use a 'hashcode % 1024' number of locks for actually protecting the read/write to the paxos system table.

        That doesn't solve the problem I'm thinking of. Suppose Proposer1 (P1) asks for a promise of ballot 1 (B1). Then P2 asks for a promise of B2. Meanwhile, P1 proposes (B1, V1). If the PaxosState.propose is not locked vs prepare for that row, then it could promise B2 but subsequently accept (B1, V1) which is a violation of the protocol.

        we'll still wait rpcTimeout (10 seconds by default!)

        I'm okay with this since nodes going down mid-request before FD notices is a pretty rare case.

        I further note that anything that starts with "only wait for quorum [even if there are other live nodes around] means that it's more likely that the replicas we don't wait for will stay out of date longer than they "should" for the MRC check.

        So I see this as two alternatives, each with drawbacks, rather than one strictly better than the other. And I prefer the alternative that already has code written.

        I'm (strongly) convinced that throwing this UAE is wrong in the first place

        Huh? If too many replicas are down and we don't get a promise, your logic says we should retry with a new (higher) ballot, in which case we will get UAE almost all the time, since few node failure modes will result in a recovery in under 100ms.

        Show
        Jonathan Ellis added a comment - We can use one true paxos state per row (which is the alternative I'm suggesting) but still use a 'hashcode % 1024' number of locks for actually protecting the read/write to the paxos system table. That doesn't solve the problem I'm thinking of. Suppose Proposer1 (P1) asks for a promise of ballot 1 (B1). Then P2 asks for a promise of B2. Meanwhile, P1 proposes (B1, V1). If the PaxosState.propose is not locked vs prepare for that row, then it could promise B2 but subsequently accept (B1, V1) which is a violation of the protocol. we'll still wait rpcTimeout (10 seconds by default!) I'm okay with this since nodes going down mid-request before FD notices is a pretty rare case. I further note that anything that starts with "only wait for quorum [even if there are other live nodes around] means that it's more likely that the replicas we don't wait for will stay out of date longer than they "should" for the MRC check. So I see this as two alternatives, each with drawbacks, rather than one strictly better than the other. And I prefer the alternative that already has code written. I'm (strongly) convinced that throwing this UAE is wrong in the first place Huh? If too many replicas are down and we don't get a promise, your logic says we should retry with a new (higher) ballot, in which case we will get UAE almost all the time, since few node failure modes will result in a recovery in under 100ms.
        Hide
        Jonathan Ellis added a comment -

        (pushed a commit to make paxos proposals always use the timestamp from the ballot.)

        Show
        Jonathan Ellis added a comment - (pushed a commit to make paxos proposals always use the timestamp from the ballot.)
        Hide
        Sylvain Lebresne added a comment -

        If the PaxosState.propose is not locked vs prepare for that row

        I'm lost. I totally agree with you that we should lock propose vs prepare.

        But we could have an array of 1024 Lock object somewhere, and both propose and prepare for key k would do:

        Lock lock = locks[k.hashCode() % 1024];
        lock.lock();
        try {
            // do whatever you should
        } finally {
            lock.release();
        }
        

        I.e. exactly what we used to do to protect 2ndary indexes writes.

        I'm not critizing the fact that propose and prepare are synchronized together, they should be. And yes, having 1 lock per row would be crazy, so I'm fine having just a fixed amount of locks. Because I'm not worried about local contention on the PaxosState propose and prepare method.

        The contention I'm more worried about is that because we may use the same PaxosState object for say row k1 and k2, a proposer on k1 will contend with a proposer on k2, even though those are different rows. And that contention might mean one of the proposer will get pre-empted during propose and will have to restart from scratch. Which is costly, because that's a bunch of round-trip message wasted.

        To take a step back, what I'm saying is that a priori, the way I would have coded this, would have been to just use the system table for paxos states, without trying to "cache" them in memory manually (which is the reason why we need to bucket them, so it don't get crazy) and just trust our storage engine initially. And we would use the locks to protect the propose and prepare phase as described above.

        Now I'm not necessarily saying that doing what I've just described would be necessarily more efficient, but at least it's not absolutely clear to me that the current patch is always a win. Because what it means on slightly faster local reads (and since we do read-by-name, even if we were to read the system table directly, we wouldn't hit disk most of the time, so it's not totally clear how much faster it is) against higher contention at the paxos level (which, as said above, is potentially order of magnitudes more costly). Again, I don't what's best, but since it feels to me that having one paxos state per row is more "natural", I feel the current way of keeping paxos states in memory is a bit of a premature optimisation, and thus I would have prefered keeping that for later if we realise there is a need to optimise.

        it's more likely that the replicas we don't wait for will stay out of date longer than they "should" for the MRC check

        First, that wouldn't matter for the correctness of the overall algorithm, so at this point I would say that it's not a big deal, at best a performance argument. Second, we can absolutely do what we do for read repair, i.e. when we receive responses after the initial QUORUM, we repair in the background if needed. We have prior arts and that's not very hard to do.

        So I see this as two alternatives, each with drawbacks, rather than one strictly better than the other

        I couldn't disagree more. Unless I misread the code, in which case I'm sorry, the current patch can get in a state where for a row no CAS operation on that row can be done at all (they will all UAE) until the user does a manual repair, because if less than a quorum of replica have received the last MRC commit message, there is no mechanism to heal that situation (since we UAE before we may heal replica if we are in a situation where say only quorum nodes are alive, but less than quorum have the last MRC).

        Also, again unless I read this wrong, the current patch may throw UAE exceptions even when no replica in the cluster are dead and there is no real reason to throw (again, in the case where less than a quorum of replica that respond have the MRC, which can happen even during normal operation with no errors).

        And I think that both of those (not combined, each of them) are big deals. Blockers in fact.

        And I prefer the alternative that already has code written.

        So to be clear, the "alternative" I'm suggesting is a few lines of changes. I'm happy to provide a commit with said change if it's not very clear what I'm talking about.

        If too many replicas are down and we don't get a promise, your logic says we should retry with a new (higher) ballot

        Absolutely not. I have no problem with throwing an UAE if too many replica are down, what else could we do? I have a problem with throwing a UAE if enough (i.e. a quorum) replica are up but not enough have the last MRC.

        Show
        Sylvain Lebresne added a comment - If the PaxosState.propose is not locked vs prepare for that row I'm lost. I totally agree with you that we should lock propose vs prepare. But we could have an array of 1024 Lock object somewhere, and both propose and prepare for key k would do: Lock lock = locks[k.hashCode() % 1024]; lock.lock(); try { // do whatever you should } finally { lock.release(); } I.e. exactly what we used to do to protect 2ndary indexes writes. I'm not critizing the fact that propose and prepare are synchronized together, they should be. And yes, having 1 lock per row would be crazy, so I'm fine having just a fixed amount of locks. Because I'm not worried about local contention on the PaxosState propose and prepare method. The contention I'm more worried about is that because we may use the same PaxosState object for say row k1 and k2, a proposer on k1 will contend with a proposer on k2, even though those are different rows. And that contention might mean one of the proposer will get pre-empted during propose and will have to restart from scratch. Which is costly, because that's a bunch of round-trip message wasted. To take a step back, what I'm saying is that a priori, the way I would have coded this, would have been to just use the system table for paxos states, without trying to "cache" them in memory manually (which is the reason why we need to bucket them, so it don't get crazy) and just trust our storage engine initially. And we would use the locks to protect the propose and prepare phase as described above. Now I'm not necessarily saying that doing what I've just described would be necessarily more efficient, but at least it's not absolutely clear to me that the current patch is always a win. Because what it means on slightly faster local reads (and since we do read-by-name, even if we were to read the system table directly, we wouldn't hit disk most of the time, so it's not totally clear how much faster it is) against higher contention at the paxos level (which, as said above, is potentially order of magnitudes more costly). Again, I don't what's best, but since it feels to me that having one paxos state per row is more "natural", I feel the current way of keeping paxos states in memory is a bit of a premature optimisation, and thus I would have prefered keeping that for later if we realise there is a need to optimise. it's more likely that the replicas we don't wait for will stay out of date longer than they "should" for the MRC check First, that wouldn't matter for the correctness of the overall algorithm, so at this point I would say that it's not a big deal, at best a performance argument. Second, we can absolutely do what we do for read repair, i.e. when we receive responses after the initial QUORUM, we repair in the background if needed. We have prior arts and that's not very hard to do. So I see this as two alternatives, each with drawbacks, rather than one strictly better than the other I couldn't disagree more. Unless I misread the code, in which case I'm sorry, the current patch can get in a state where for a row no CAS operation on that row can be done at all (they will all UAE) until the user does a manual repair, because if less than a quorum of replica have received the last MRC commit message, there is no mechanism to heal that situation (since we UAE before we may heal replica if we are in a situation where say only quorum nodes are alive, but less than quorum have the last MRC). Also, again unless I read this wrong, the current patch may throw UAE exceptions even when no replica in the cluster are dead and there is no real reason to throw (again, in the case where less than a quorum of replica that respond have the MRC, which can happen even during normal operation with no errors). And I think that both of those (not combined, each of them) are big deals. Blockers in fact. And I prefer the alternative that already has code written. So to be clear, the "alternative" I'm suggesting is a few lines of changes. I'm happy to provide a commit with said change if it's not very clear what I'm talking about. If too many replicas are down and we don't get a promise, your logic says we should retry with a new (higher) ballot Absolutely not. I have no problem with throwing an UAE if too many replica are down , what else could we do? I have a problem with throwing a UAE if enough (i.e. a quorum) replica are up but not enough have the last MRC.
        Hide
        Jonathan Ellis added a comment -

        it feels to me that having one paxos state per row is more "natural"

        All right. Done and pushed.

        I think that we may want to TTL the Paxos state

        Hmm, not sure where to fit this in. We store gcgs in the schema, but system tables are not user-modifiable (and even if they were, this one is node-local). I guess we could add a global config setting, which doesn't thrill me.

        the current patch can get in a state where for a row no CAS operation on that row can be done at all (they will all UAE) until the user does a manual repair, because if less than a quorum of replica have received the last MRC commit message

        I must be missing something, because that's by design – this sounds exactly like what we discussed to preserve correctness. "On result of the prepare, we would wait for a quorum of people agreeing on the last MRC (i.e. a quorum has learn the last value) before allowing to proceed with our own value."

        Show
        Jonathan Ellis added a comment - it feels to me that having one paxos state per row is more "natural" All right. Done and pushed. I think that we may want to TTL the Paxos state Hmm, not sure where to fit this in. We store gcgs in the schema, but system tables are not user-modifiable (and even if they were, this one is node-local). I guess we could add a global config setting, which doesn't thrill me. the current patch can get in a state where for a row no CAS operation on that row can be done at all (they will all UAE) until the user does a manual repair, because if less than a quorum of replica have received the last MRC commit message I must be missing something, because that's by design – this sounds exactly like what we discussed to preserve correctness. "On result of the prepare, we would wait for a quorum of people agreeing on the last MRC (i.e. a quorum has learn the last value) before allowing to proceed with our own value."
        Hide
        Sylvain Lebresne added a comment -

        this sounds exactly like what we discussed to preserve correctness

        Not exactly. What we need to preserve correctness is make sure we don't start a new round (i.e. we don't propose our own value) until we can guarantee that a quorum of replica have learned about the last MRC. And that is correctly done by this patch, I'm not contesting that. However, the patch equates "not starting a new round" with "throwing an UAE", which is what I have a problem with.

        Let me put it another way: the reason why I suggested storing the mostRecentUpdate is so that if we get in a state where less than a quorum of replica have learned the last commit due to some errors, then we can repair them by sending them that mostRecentUpdate (after which we'd be able to make progress again). But that's not what the current patch does. The only time the current patch uses the mostRecentUpdate is when we have already validated that we have a quorum of node on the MRC. In other words, it uses the mostRecentUpdate only as a slight optimisation but not to avoid the algorithm getting stuck, which should be the main goal (the only goal really imo).

        Anyway, probably I haven't been clear on what the alternative I'm suggesting is, so to clear that up I've pushed the patch at https://github.com/pcmanus/cassandra/commits/5062-5 (on top of the last v4). There is 4 commits in fact, but the important one is the first one only. As you'll see, this still respect the invariant that "we don't start a new round until we can guarantee that a quorum of replica have learned about the last MRC", but if we can't prove that we have such a quorum, we repair nodes and move on with the algorithm instead of throwing an exception.

        The 2nd commit is just doing the "let's skip waiting on response once we got a 'not promised'", since that's trivial on top of the preceding patch. As for the two last one, I wanted to check my "use the Commit class more generally" idea. After having done it, I do think it make the clone slightly more readable, if only because it allows to encapsulate better a few behavior in the Commit class, because it avoid duplicating serialization code too much, and because it removes PrepareRequest and ProposeRequest that are kind of uninteresting (and I don't find that grouping things even for prepare is really making things worth). But I'll agree to disagree if you still think it's a bad idea.

        Hmm, not sure where to fit this in.

        Imo, the best option would be to have a 'paxos ttl' setting in the user CFs like we have gc_grace. Then each write to the PaxosCF would use the value for the CF this is an update of (i.e. not all row in PaxosCF would have the same ttl if you don't same the same ttl in all of your CF, but that's ok).

        One small concern might be that currently we share the same Paxos state for rows belonging to different CF is they share the same key. But maybe that just mean that we should stop doing that and include the cfId in the paxos sate row key. After all, the API doesn't allow to do a CAS that span multiple CF anyway. And I'm not convinced twisting the API to allow it is worth the trouble.

        Show
        Sylvain Lebresne added a comment - this sounds exactly like what we discussed to preserve correctness Not exactly. What we need to preserve correctness is make sure we don't start a new round (i.e. we don't propose our own value) until we can guarantee that a quorum of replica have learned about the last MRC. And that is correctly done by this patch, I'm not contesting that. However, the patch equates "not starting a new round" with "throwing an UAE", which is what I have a problem with. Let me put it another way: the reason why I suggested storing the mostRecentUpdate is so that if we get in a state where less than a quorum of replica have learned the last commit due to some errors, then we can repair them by sending them that mostRecentUpdate (after which we'd be able to make progress again). But that's not what the current patch does. The only time the current patch uses the mostRecentUpdate is when we have already validated that we have a quorum of node on the MRC. In other words, it uses the mostRecentUpdate only as a slight optimisation but not to avoid the algorithm getting stuck, which should be the main goal (the only goal really imo). Anyway, probably I haven't been clear on what the alternative I'm suggesting is, so to clear that up I've pushed the patch at https://github.com/pcmanus/cassandra/commits/5062-5 (on top of the last v4). There is 4 commits in fact, but the important one is the first one only. As you'll see, this still respect the invariant that "we don't start a new round until we can guarantee that a quorum of replica have learned about the last MRC", but if we can't prove that we have such a quorum, we repair nodes and move on with the algorithm instead of throwing an exception. The 2nd commit is just doing the "let's skip waiting on response once we got a 'not promised'", since that's trivial on top of the preceding patch. As for the two last one, I wanted to check my "use the Commit class more generally" idea. After having done it, I do think it make the clone slightly more readable, if only because it allows to encapsulate better a few behavior in the Commit class, because it avoid duplicating serialization code too much, and because it removes PrepareRequest and ProposeRequest that are kind of uninteresting (and I don't find that grouping things even for prepare is really making things worth). But I'll agree to disagree if you still think it's a bad idea. Hmm, not sure where to fit this in. Imo, the best option would be to have a 'paxos ttl' setting in the user CFs like we have gc_grace. Then each write to the PaxosCF would use the value for the CF this is an update of (i.e. not all row in PaxosCF would have the same ttl if you don't same the same ttl in all of your CF, but that's ok). One small concern might be that currently we share the same Paxos state for rows belonging to different CF is they share the same key. But maybe that just mean that we should stop doing that and include the cfId in the paxos sate row key. After all, the API doesn't allow to do a CAS that span multiple CF anyway. And I'm not convinced twisting the API to allow it is worth the trouble.
        Show
        Sylvain Lebresne added a comment - Nit: Just learned about http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/util/concurrent/Striped.html . Could replace our array of locks.
        Hide
        Jonathan Ellis added a comment -

        If we're not going to wait for everyone we sent a message to, should we allow propose/commit to accept ballots newer than the one it has promised, on the theory that our proposer won't do anything nefarious, so a newer ballot must mean that we raced with a pending prepare?

        Show
        Jonathan Ellis added a comment - If we're not going to wait for everyone we sent a message to, should we allow propose/commit to accept ballots newer than the one it has promised, on the theory that our proposer won't do anything nefarious, so a newer ballot must mean that we raced with a pending prepare?
        Hide
        Jonathan Ellis added a comment -

        (Re Striped, in this simple case I'd rather have the syntactic sugar of synchronized and roll my own lock array. Why didn't java7 make locks auto-closeable?)

        Show
        Jonathan Ellis added a comment - (Re Striped, in this simple case I'd rather have the syntactic sugar of synchronized and roll my own lock array. Why didn't java7 make locks auto-closeable?)
        Hide
        Jonathan Ellis added a comment -

        I was going to write that I'm still not a fan of "everything is a Commit" approach, because the duplicated information in PaxosState and PrepareResponse matches the semantics of what we are doing poorly – it appears to imply that different responses could be for different keys, or even have different keys for the two Commits in PaxosState (or PrepareResponse). It also obfuscates that prepare only cares about ballot.

        But then I started adding cfid and we have the same problem at the ColumnFamily level. So screw it, I welcome my new Commit overlord. But I'm adding some asserts to make myself feel better.

        Pushed cfid incorporation and TTL code to https://github.com/pcmanus/cassandra/commits/5062-5.

        (I think to start with using gcgs for TTL is fine.)

        Show
        Jonathan Ellis added a comment - I was going to write that I'm still not a fan of "everything is a Commit" approach, because the duplicated information in PaxosState and PrepareResponse matches the semantics of what we are doing poorly – it appears to imply that different responses could be for different keys, or even have different keys for the two Commits in PaxosState (or PrepareResponse). It also obfuscates that prepare only cares about ballot. But then I started adding cfid and we have the same problem at the ColumnFamily level. So screw it, I welcome my new Commit overlord. But I'm adding some asserts to make myself feel better. Pushed cfid incorporation and TTL code to https://github.com/pcmanus/cassandra/commits/5062-5 . (I think to start with using gcgs for TTL is fine.)
        Hide
        Sylvain Lebresne added a comment -

        should we allow propose/commit to accept ballots newer than the one it has promised

        That comment made me realize that at least for the commit part, my patch was assuming that we were pretty much always accepting commits, so it was slightly broken. So I've pushed one more commit on the branch to fix that. Hopefully that makes sense. I note that I just saw your last push and that last commit is not on top of that but rather on top of my earlier v5. Maybe you can just cherry-pick that last commit?

        For the propose, I think accepting newer-than-promised ballot would be ok, yes. The promise made during prepare is to not accept anything older than the ballot we promise to, but accepting something newer should be fine.

        Show
        Sylvain Lebresne added a comment - should we allow propose/commit to accept ballots newer than the one it has promised That comment made me realize that at least for the commit part, my patch was assuming that we were pretty much always accepting commits, so it was slightly broken. So I've pushed one more commit on the branch to fix that. Hopefully that makes sense. I note that I just saw your last push and that last commit is not on top of that but rather on top of my earlier v5. Maybe you can just cherry-pick that last commit? For the propose, I think accepting newer-than-promised ballot would be ok, yes. The promise made during prepare is to not accept anything older than the ballot we promise to, but accepting something newer should be fine.
        Hide
        Sylvain Lebresne added a comment -

        I think to start with using gcgs for TTL is fine

        I though of that, but I'm slightly afraid of people using gcgs == 0 on a CF because they are not doing any deletes on that CF (but may have a fixed TTL on everything, which is a valid reason to use gcgs == 0). I'd be fine spawning a following ticket to create a separate setting however.

        Show
        Sylvain Lebresne added a comment - I think to start with using gcgs for TTL is fine I though of that, but I'm slightly afraid of people using gcgs == 0 on a CF because they are not doing any deletes on that CF (but may have a fixed TTL on everything, which is a valid reason to use gcgs == 0). I'd be fine spawning a following ticket to create a separate setting however.
        Hide
        Jonathan Ellis added a comment -

        Cherry-picked, and made eraseInProgressProposal actually generate different cql.

        Also I forgot to use -p so the proposal accept change is in the same commit.

        Created CASSANDRA-5451 for TTL config.

        Show
        Jonathan Ellis added a comment - Cherry-picked, and made eraseInProgressProposal actually generate different cql. Also I forgot to use -p so the proposal accept change is in the same commit. Created CASSANDRA-5451 for TTL config.
        Hide
        Sylvain Lebresne added a comment -

        I guess I'm out of nitpicks for this part, +1.

        Well, maybe just one last nit for the road: the equals and hashcode in Commit are missing @Override.

        Show
        Sylvain Lebresne added a comment - I guess I'm out of nitpicks for this part, +1. Well, maybe just one last nit for the road: the equals and hashcode in Commit are missing @Override.
        Hide
        Sylvain Lebresne added a comment - - edited

        Actually my bad, I should have ran the test before, but there is a few unit test failures. At least ColumnFamilyStoreTest is failing because it uses ColumnFamily.isEmpty that has been changed by the patch to include the deletionInfo. I'm fine just updating the test to use getColumnCount() instead, but it would be nice to do a quick check of all the current uses of ColumnFamily.isEmpty to make sure this don't break anything else (and we should update the comment of the isEmpty method).

        I also get

            [junit] Testcase: testRowMutationRead(org.apache.cassandra.db.SerializationsTest):	Caused an ERROR
            [junit] addr is of illegal length
            [junit] java.net.UnknownHostException: addr is of illegal length
            [junit] 	at java.net.InetAddress.getByAddress(InetAddress.java:935)
            [junit] 	at java.net.InetAddress.getByAddress(InetAddress.java:1318)
            [junit] 	at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:38)
            [junit] 	at org.apache.cassandra.net.MessageIn.read(MessageIn.java:62)
            [junit] 	at org.apache.cassandra.db.SerializationsTest.testRowMutationRead(SerializationsTest.java:263)
        

        My initial reaction was to say that this is unrelated to this ticket but the test passes on trunk. Maybe the patch just need to be rebased?

        Show
        Sylvain Lebresne added a comment - - edited Actually my bad, I should have ran the test before, but there is a few unit test failures. At least ColumnFamilyStoreTest is failing because it uses ColumnFamily.isEmpty that has been changed by the patch to include the deletionInfo. I'm fine just updating the test to use getColumnCount() instead, but it would be nice to do a quick check of all the current uses of ColumnFamily.isEmpty to make sure this don't break anything else (and we should update the comment of the isEmpty method). I also get [junit] Testcase: testRowMutationRead(org.apache.cassandra.db.SerializationsTest): Caused an ERROR [junit] addr is of illegal length [junit] java.net.UnknownHostException: addr is of illegal length [junit] at java.net.InetAddress.getByAddress(InetAddress.java:935) [junit] at java.net.InetAddress.getByAddress(InetAddress.java:1318) [junit] at org.apache.cassandra.net.CompactEndpointSerializationHelper.deserialize(CompactEndpointSerializationHelper.java:38) [junit] at org.apache.cassandra.net.MessageIn.read(MessageIn.java:62) [junit] at org.apache.cassandra.db.SerializationsTest.testRowMutationRead(SerializationsTest.java:263) My initial reaction was to say that this is unrelated to this ticket but the test passes on trunk. Maybe the patch just need to be rebased?
        Hide
        Jonathan Ellis added a comment -

        I ended up inlining the old isEmpty and creating a new one for CAS to use. Also AbstractSSTableSimpleWriter should be using the "new" isEmpty. The rest I left alone.

        SerializationsTest was failing because I'd inadvertently removed the MUTATION entry for MS.verbSerializers. (So MessageIn said, oh, serializer == null? Here's an empty Message for you, and the next Message in the test started reading partway through the last one.)

        Fixed and committed!

        Show
        Jonathan Ellis added a comment - I ended up inlining the old isEmpty and creating a new one for CAS to use. Also AbstractSSTableSimpleWriter should be using the "new" isEmpty. The rest I left alone. SerializationsTest was failing because I'd inadvertently removed the MUTATION entry for MS.verbSerializers. (So MessageIn said, oh, serializer == null? Here's an empty Message for you, and the next Message in the test started reading partway through the last one.) Fixed and committed!
        Hide
        Jonathan Ellis added a comment -

        Subtasks complete; declaring victory.

        Show
        Jonathan Ellis added a comment - Subtasks complete; declaring victory.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Jonathan Ellis
          • Votes:
            5 Vote for this issue
            Watchers:
            33 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development