Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Atomic test-and-set insert operation would be nice: "set value to X but only if the current value is still Y." This allows a sort of optimistic consistency: perform a GET, then perform test-and-set with the value of that GET as Y.

      I do not think that this requires strong consistency to be useful.

        Activity

        Hide
        sandeep_tata Sandeep Tata added a comment -

        This'd be great, except we probably want to condition on the timestamp. i.e.,

        Get X
        Put X+1 if latest timestamp hasn't changed

        That makes sure Cassandra doesn't have to to a bit-by-bit compare on an arbitrarily large value, just compare timestamps.
        This is the approach that PNUTS takes. See http://research.yahoo.com/files/pnuts.pdf

        Show
        sandeep_tata Sandeep Tata added a comment - This'd be great, except we probably want to condition on the timestamp. i.e., Get X Put X+1 if latest timestamp hasn't changed That makes sure Cassandra doesn't have to to a bit-by-bit compare on an arbitrarily large value, just compare timestamps. This is the approach that PNUTS takes. See http://research.yahoo.com/files/pnuts.pdf
        Hide
        junrao Jun Rao added a comment -

        We have to be careful about the semantic of test-and-set insert. What happens if the condition is met at some, but not all replicas? If we want to fail the insert in this case, we need to be able to roll back the insert at some replicas.

        Show
        junrao Jun Rao added a comment - We have to be careful about the semantic of test-and-set insert. What happens if the condition is met at some, but not all replicas? If we want to fail the insert in this case, we need to be able to roll back the insert at some replicas.
        Hide
        jbellis Jonathan Ellis added a comment -

        Sandeep: you're right, timestamp is better here.

        Jun: I don't think we need to do anything special – we already have code to handle saying "i want to block for N replicas before calling the write a success." that could work here unchanged.

        Show
        jbellis Jonathan Ellis added a comment - Sandeep: you're right, timestamp is better here. Jun: I don't think we need to do anything special – we already have code to handle saying "i want to block for N replicas before calling the write a success." that could work here unchanged.
        Hide
        junrao Jun Rao added a comment -

        So you block until all N replicas reply and return success as long as 1 reply says the insert is successful?

        Show
        junrao Jun Rao added a comment - So you block until all N replicas reply and return success as long as 1 reply says the insert is successful?
        Hide
        jbellis Jonathan Ellis added a comment -

        No, you return failure unless N succeed. (So obviously this can reduce availability if you are setting N = replication count. Normally setting it to a simple majority is sufficient.)

        Show
        jbellis Jonathan Ellis added a comment - No, you return failure unless N succeed. (So obviously this can reduce availability if you are setting N = replication count. Normally setting it to a simple majority is sufficient.)
        Hide
        junrao Jun Rao added a comment -

        That's when complexity comes in. Say N=3 and you insert to node A,B, and C. The insert succeeds at A, but failed at B and C. You want to fail the insert. However, the insert already succeeds at A. Now, you should roll back the insert made at A. Then, you need to remember the previous version at A. You will be essentially doing some sort of 2-phase commit.

        Show
        junrao Jun Rao added a comment - That's when complexity comes in. Say N=3 and you insert to node A,B, and C. The insert succeeds at A, but failed at B and C. You want to fail the insert. However, the insert already succeeds at A. Now, you should roll back the insert made at A. Then, you need to remember the previous version at A. You will be essentially doing some sort of 2-phase commit.
        Hide
        jbellis Jonathan Ellis added a comment -

        No, that's up to the client. It may wish to retry the write, or try to write back the old value.

        Show
        jbellis Jonathan Ellis added a comment - No, that's up to the client. It may wish to retry the write, or try to write back the old value.
        Hide
        junrao Jun Rao added a comment -

        A common case for a client is that if a test-and-set insert fails, the client will re-read the latest version and then take some action accordingly. The client could read an incorrect version in this case.

        Show
        junrao Jun Rao added a comment - A common case for a client is that if a test-and-set insert fails, the client will re-read the latest version and then take some action accordingly. The client could read an incorrect version in this case.
        Hide
        jbellis Jonathan Ellis added a comment -

        The client should not do that. It knows the old version (or it couldn't be doing test-and-set) and it knows the new version and it knows that not all replicas were written but some were. Doing a read cannot possibly tell it anything it doesn't already know.

        Show
        jbellis Jonathan Ellis added a comment - The client should not do that. It knows the old version (or it couldn't be doing test-and-set) and it knows the new version and it knows that not all replicas were written but some were. Doing a read cannot possibly tell it anything it doesn't already know.
        Hide
        junrao Jun Rao added a comment -

        Here is a contrived but simple example. Suppose that some shopping cart info is stored in a column and multiple clients are updating it. A client may want to update the column value only if the value hasn't changed since he last read it. If the column value has indeed changed, he wants to read the latest value back and MERGE it with the local value he has and then reinsert. This will be hard to do with the current test-and-set insert semantic.

        Show
        junrao Jun Rao added a comment - Here is a contrived but simple example. Suppose that some shopping cart info is stored in a column and multiple clients are updating it. A client may want to update the column value only if the value hasn't changed since he last read it. If the column value has indeed changed, he wants to read the latest value back and MERGE it with the local value he has and then reinsert. This will be hard to do with the current test-and-set insert semantic.
        Hide
        jbellis Jonathan Ellis added a comment -

        Like I said on the ML, in that case you would have to do a read that waits for all replicas. Which will hurt availability but I don't see any alternative.

        Again, this isn't intended going to provide strong consistency. You can still come up with failure scenarios where inconsistent data could be read temporarily. I'm not sure if it's going to be useful given that.

        Show
        jbellis Jonathan Ellis added a comment - Like I said on the ML, in that case you would have to do a read that waits for all replicas. Which will hurt availability but I don't see any alternative. Again, this isn't intended going to provide strong consistency. You can still come up with failure scenarios where inconsistent data could be read temporarily. I'm not sure if it's going to be useful given that.
        Hide
        sandeep_tata Sandeep Tata added a comment -

        Say we have 2 processes doing test-and-set.

        Read X Read X
        T&S X +1 T&S X+2

        The two T&S are routed to A , B, C:

        @A T&S X + 1
        @B T&S X+2
        @C T&S X+2
        @A T&S X+2 <------Fail, A already updated X
        @B T&S X+1 <---- fail
        @C T&S X+2 <---- fail

        X+1 fails, X+2 succeeds.
        A contains a bad value that can be repaired by majority read. Any holes ?

        Show
        sandeep_tata Sandeep Tata added a comment - Say we have 2 processes doing test-and-set. Read X Read X T&S X +1 T&S X+2 The two T&S are routed to A , B, C: @A T&S X + 1 @B T&S X+2 @C T&S X+2 @A T&S X+2 <------Fail, A already updated X @B T&S X+1 <---- fail @C T&S X+2 <---- fail X+1 fails, X+2 succeeds. A contains a bad value that can be repaired by majority read. Any holes ?
        Hide
        junrao Jun Rao added a comment -

        What are the version numbers set by the 2 updates? If the version numbers are the same, it's not clear if majority read helps since the conflict resolution logic is based on version number.

        Show
        junrao Jun Rao added a comment - What are the version numbers set by the 2 updates? If the version numbers are the same, it's not clear if majority read helps since the conflict resolution logic is based on version number.
        Hide
        sandeep_tata Sandeep Tata added a comment -

        There's a problem with my previous example. Here it is again, slightly improved:

        Process1:
        Read X
        T&S X +1 , timestamp=5

        Process2:
        Read X
        T&S X +2, timestamp=4

        The two T&S are routed to A , B, C:

        @A T&S X + 1
        @B T&S X+2
        @C T&S X+2
        @A T&S X+2 <------Fail, A already updated X
        @B T&S X+1 <---- fail
        @C T&S X+2 <---- fail

        The node that coordinated the T&S for Process1 reports a failure because 2 out of 3 T&S attempts failed.
        Process2 reports success because 1 out of 3 failed, 2 succeeded.
        The correct value at the node is now X+2 (from Process2)

        Now, if node C dies, here's what we have on A & B:

        @A: X+1, ts=5
        @B: X+2, ts=4

        We can't resolve this using majority (we don't do that currently anyway). The latest timestamp rule will give us the wrong answer, and we'll lose the update from Process2. Bad.

        If the value at A was fixed before C went down, we wouldn't be in this trouble:

        The node coordinating the T&S should always "cancel" the successful (minority) writes from a failed attempt. A simple way to implement cancelled writes would be to add a bit for that column indicating that the column is "cancelled". The cancelling write should also use T&S to avoid trampling on a newer value if the old invalid value was replaced in the meantime. If the cancelling T&S fails, we have nothing to worry about. When a node sees a cancelled column, it should replace that value with the value from a peer even if the peer node sends a value with an older timestamp. This would involve 1) a simple change to read repair logic, 2) change to the read path to make sure cancelled columns are not read (if a read results in a cancelled column, then you trigger quorum read), and 3) modifying ColumnFamily to understand cancelled writes.

        The cancel-writes approach is cleaner than a forced rollback to an older value by removing this bad value from the memtable. At first glance it seems simpler, but what if the memtable got flushed in the meantime? You can't really remove this bad value from the SSTable.

        What happens with Hinted-handoff:
        If you don't hear back from any of the intended peers, the T&S is assumed failed. Handing off T&S doesn't make sense because it requires a success quorum. The failed node can come back up and get the correct values using read repair – business as usual.

        Of course, this still leaves us with a window of vulnerability. If C goes down (in the example above) and the node running Process1 dies before it has a chance to fix A, we still end up with an invalid update. But I'm guessing this is a good tradeoff to make in a system that relies on eventual consistency. This becomes a problem if C happens to be the node running Process1. The alternative would be a protocol based on tentative writes and confirmations – that brings in a whole slew of other new semantics and other complications.

        I'm going to scan the literature to see if there are alternative protocols in this consistency class – any thoughts on this approach? Any alternatives?

        Show
        sandeep_tata Sandeep Tata added a comment - There's a problem with my previous example. Here it is again, slightly improved: Process1: Read X T&S X +1 , timestamp=5 Process2: Read X T&S X +2, timestamp=4 The two T&S are routed to A , B, C: @A T&S X + 1 @B T&S X+2 @C T&S X+2 @A T&S X+2 <------Fail, A already updated X @B T&S X+1 <---- fail @C T&S X+2 <---- fail The node that coordinated the T&S for Process1 reports a failure because 2 out of 3 T&S attempts failed. Process2 reports success because 1 out of 3 failed, 2 succeeded. The correct value at the node is now X+2 (from Process2) Now, if node C dies, here's what we have on A & B: @A: X+1, ts=5 @B: X+2, ts=4 We can't resolve this using majority (we don't do that currently anyway). The latest timestamp rule will give us the wrong answer, and we'll lose the update from Process2. Bad. If the value at A was fixed before C went down, we wouldn't be in this trouble: The node coordinating the T&S should always "cancel" the successful (minority) writes from a failed attempt. A simple way to implement cancelled writes would be to add a bit for that column indicating that the column is "cancelled". The cancelling write should also use T&S to avoid trampling on a newer value if the old invalid value was replaced in the meantime. If the cancelling T&S fails, we have nothing to worry about. When a node sees a cancelled column, it should replace that value with the value from a peer even if the peer node sends a value with an older timestamp. This would involve 1) a simple change to read repair logic, 2) change to the read path to make sure cancelled columns are not read (if a read results in a cancelled column, then you trigger quorum read), and 3) modifying ColumnFamily to understand cancelled writes. The cancel-writes approach is cleaner than a forced rollback to an older value by removing this bad value from the memtable. At first glance it seems simpler, but what if the memtable got flushed in the meantime? You can't really remove this bad value from the SSTable. What happens with Hinted-handoff: If you don't hear back from any of the intended peers, the T&S is assumed failed. Handing off T&S doesn't make sense because it requires a success quorum. The failed node can come back up and get the correct values using read repair – business as usual. Of course, this still leaves us with a window of vulnerability. If C goes down (in the example above) and the node running Process1 dies before it has a chance to fix A, we still end up with an invalid update. But I'm guessing this is a good tradeoff to make in a system that relies on eventual consistency. This becomes a problem if C happens to be the node running Process1. The alternative would be a protocol based on tentative writes and confirmations – that brings in a whole slew of other new semantics and other complications. I'm going to scan the literature to see if there are alternative protocols in this consistency class – any thoughts on this approach? Any alternatives?
        Hide
        sandeep_tata Sandeep Tata added a comment -

        Looks like it is much too tricky to guarantee convergence without somewhat stronger consistency if we admit a test-and-set instruction.

        I've started work on a simple implementation without these additional repair mechanisms – apps can use it safely if they either
        a) manage concurrency in the app (bad assumption – I don't expect apps to be able to do this, if they can, T&S doesn't really provide much value)
        b) can expect to dial up the consistency if they want to use this part of the API (very useful if they can do this without losing performance for those parts of the app that don't need consistency)

        Show
        sandeep_tata Sandeep Tata added a comment - Looks like it is much too tricky to guarantee convergence without somewhat stronger consistency if we admit a test-and-set instruction. I've started work on a simple implementation without these additional repair mechanisms – apps can use it safely if they either a) manage concurrency in the app (bad assumption – I don't expect apps to be able to do this, if they can, T&S doesn't really provide much value) b) can expect to dial up the consistency if they want to use this part of the API (very useful if they can do this without losing performance for those parts of the app that don't need consistency)
        Hide
        jbellis Jonathan Ellis added a comment -

        I'd like to get the "dial up consistency" part in first rather than the other way around, since something like that (Zookeeper?) would be useful in other situations as well while T&S w/o consistency is not really useful as you point out.

        Show
        jbellis Jonathan Ellis added a comment - I'd like to get the "dial up consistency" part in first rather than the other way around, since something like that (Zookeeper?) would be useful in other situations as well while T&S w/o consistency is not really useful as you point out.
        Hide
        jbellis Jonathan Ellis added a comment -

        ... actually, thinking about it a little more, if you are using ZK locks for consistency I don't really see the point of T&S at all. Read/write locks are described under shared locks here: http://zookeeper.wiki.sourceforge.net/ZooKeeperRecipes.

        Show
        jbellis Jonathan Ellis added a comment - ... actually, thinking about it a little more, if you are using ZK locks for consistency I don't really see the point of T&S at all. Read/write locks are described under shared locks here: http://zookeeper.wiki.sourceforge.net/ZooKeeperRecipes .
        Hide
        sandeep_tata Sandeep Tata added a comment -

        Yup, I'm planning to submit a "dial-up the consistency patch" first (modulo some failure scenarios, it's not super hard). I'm testing it on a local cluster today .. I'll open a ticket and describe the design. Just wanted to give you guys a heads up about the status on this one and reasons for the consistency patch.

        Using ZK to implement concurrency control is not a great option:

        If you expect high level of conflicts, you're in trouble with a shared nothing architecture anyway. Talking to ZK for this is going to kill performance.

        If you expect low level of conflicts, you're better off with optimistic concurrency control, and T&S is the right primitive for that.

        Show
        sandeep_tata Sandeep Tata added a comment - Yup, I'm planning to submit a "dial-up the consistency patch" first (modulo some failure scenarios, it's not super hard). I'm testing it on a local cluster today .. I'll open a ticket and describe the design. Just wanted to give you guys a heads up about the status on this one and reasons for the consistency patch. Using ZK to implement concurrency control is not a great option: If you expect high level of conflicts, you're in trouble with a shared nothing architecture anyway. Talking to ZK for this is going to kill performance. If you expect low level of conflicts, you're better off with optimistic concurrency control, and T&S is the right primitive for that.
        Hide
        jbellis Jonathan Ellis added a comment -

        ISTM that we're not going to have a correct T&S w/o the kind of more brute-force locking that ZK gives us, but I will wait to see what you have come up with.

        Show
        jbellis Jonathan Ellis added a comment - ISTM that we're not going to have a correct T&S w/o the kind of more brute-force locking that ZK gives us, but I will wait to see what you have come up with.
        Hide
        jbellis Jonathan Ellis added a comment -

        It seems clear that this is unworkable in cassandra core.

        Show
        jbellis Jonathan Ellis added a comment - It seems clear that this is unworkable in cassandra core.

          People

          • Assignee:
            Unassigned
            Reporter:
            jbellis Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development