Cassandra
  1. Cassandra
  2. CASSANDRA-2494

Quorum reads are not monotonically consistent

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: Core
    • Labels:
      None

      Description

      As discussed in this thread,

      http://www.mail-archive.com/user@cassandra.apache.org/msg12421.html

      Quorum reads should be consistent. Assume we have a cluster of 3 nodes (X,Y,Z) and a replication factor of 3. If a write of N is committed to X, but not Y and Z, then a read from X should not return N unless the read is committed to at least two nodes. To ensure this, a read from X should wait for an ack of the read repair write from either Y or Z before returning.

      Are there system tests for cassandra? If so, there should be a test similar to the original post in the email thread. One thread should write 1,2,3... at consistency level ONE. Another thread should read at consistency level QUORUM from a random host, and verify that each read is >= the last read.

      1. 2494-v2.txt
        22 kB
        Jonathan Ellis
      2. 2494.txt
        6 kB
        Jonathan Ellis

        Activity

        Hide
        Hudson added a comment -

        Integrated in Cassandra #1017 (See https://builds.apache.org/job/Cassandra/1017/)
        provide monotonic read consistency
        patch by jbellis; reviewed by slebresne for CASSANDRA-2494

        jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156758
        Files :

        • /cassandra/trunk/CHANGES.txt
        • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/RepairCallback.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/RowRepairResolver.java
        • /cassandra/trunk/src/java/org/apache/cassandra/utils/FBUtilities.java
        • /cassandra/trunk/src/java/org/apache/cassandra/service/RangeSliceResponseResolver.java
        Show
        Hudson added a comment - Integrated in Cassandra #1017 (See https://builds.apache.org/job/Cassandra/1017/ ) provide monotonic read consistency patch by jbellis; reviewed by slebresne for CASSANDRA-2494 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156758 Files : /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java /cassandra/trunk/src/java/org/apache/cassandra/service/RepairCallback.java /cassandra/trunk/src/java/org/apache/cassandra/service/RowRepairResolver.java /cassandra/trunk/src/java/org/apache/cassandra/utils/FBUtilities.java /cassandra/trunk/src/java/org/apache/cassandra/service/RangeSliceResponseResolver.java
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed
        Hide
        Sylvain Lebresne added a comment -

        Well, sort of – rpctimeout is working exactly as intended, i.e., to prevent waiting indefinitely for a node that died after we sent it a request. Treating it as "max time to respond to client" has never really been correct. (E.g., in the CL > ONE case we can already wait up to rpctimeout twice, one for the original digest read set, and again for the data read after mismatch.) So I don't think we should try to be clever with that here.

        Fair enough. It would probably be useful to make rpctimeout meaning closer to "max time to respond to client". Created CASSANDRA-3018 for that though.

        +1 on v2.

        Show
        Sylvain Lebresne added a comment - Well, sort of – rpctimeout is working exactly as intended, i.e., to prevent waiting indefinitely for a node that died after we sent it a request. Treating it as "max time to respond to client" has never really been correct. (E.g., in the CL > ONE case we can already wait up to rpctimeout twice, one for the original digest read set, and again for the data read after mismatch.) So I don't think we should try to be clever with that here. Fair enough. It would probably be useful to make rpctimeout meaning closer to "max time to respond to client". Created CASSANDRA-3018 for that though. +1 on v2.
        Hide
        Jonathan Ellis added a comment -

        There is a number of case where scheduleRepairs may not have been called

        defaulted repairResults to emptyList.

        That new wait can extend the rpc timeout to almost twice what it should be

        Well, sort of – rpctimeout is working exactly as intended, i.e., to prevent waiting indefinitely for a node that died after we sent it a request. Treating it as "max time to respond to client" has never really been correct. (E.g., in the CL > ONE case we can already wait up to rpctimeout twice, one for the original digest read set, and again for the data read after mismatch.) So I don't think we should try to be clever with that here.

        Shouldn't we give the same love to range requests, now that we do repairs there too

        done.

        Show
        Jonathan Ellis added a comment - There is a number of case where scheduleRepairs may not have been called defaulted repairResults to emptyList. That new wait can extend the rpc timeout to almost twice what it should be Well, sort of – rpctimeout is working exactly as intended, i.e., to prevent waiting indefinitely for a node that died after we sent it a request. Treating it as "max time to respond to client" has never really been correct. (E.g., in the CL > ONE case we can already wait up to rpctimeout twice, one for the original digest read set, and again for the data read after mismatch.) So I don't think we should try to be clever with that here. Shouldn't we give the same love to range requests, now that we do repairs there too done.
        Hide
        Sylvain Lebresne added a comment -

        Ok, I now see what you mean
        Makes perfect sense.

        Comments on the patch:

        • There is a number of case where scheduleRepairs may not have been called (if the read for repair timeout and/or we had no or only 1 response or we have the situation were removeDeleted removes everything), so repairResults will be null in those cases. In SP, we should check for it.
        • That new wait can extend the rpc timeout to almost twice what it should be. I agree that it is not a huge deal, but by exposing the 'startTime' stored in RepairCallback we can make it so we don't extend it that way.
        • Shouldn't we give the same love to range requests, now that we do repairs there too ?
        Show
        Sylvain Lebresne added a comment - Ok, I now see what you mean Makes perfect sense. Comments on the patch: There is a number of case where scheduleRepairs may not have been called (if the read for repair timeout and/or we had no or only 1 response or we have the situation were removeDeleted removes everything), so repairResults will be null in those cases. In SP, we should check for it. That new wait can extend the rpc timeout to almost twice what it should be. I agree that it is not a huge deal, but by exposing the 'startTime' stored in RepairCallback we can make it so we don't extend it that way. Shouldn't we give the same love to range requests, now that we do repairs there too ?
        Hide
        Jonathan Ellis added a comment -

        I don't see any reason not to guarantee that the replicas we read from provide monotonic read consistency.

        Patch attached to do this.

        Show
        Jonathan Ellis added a comment - I don't see any reason not to guarantee that the replicas we read from provide monotonic read consistency. Patch attached to do this.
        Hide
        Sean Bridges added a comment - - edited

        To be clear, this is a new guarantee. The current guarantee is R+W>N gives you consistency. This bug is asking that a successful quorum read of A means that A has been committed to a quorum of nodes.

        "How can we ensure the quorum read property that you want ?"

        If when reading at quorum, and no quorum can be found which agrees on a particular value, then the coordinator ( ? ) will wait for acks of read repair writes (or perhaps just do normal writes) to be returned from a sufficient number of nodes to ensure that the value has been committed to a quorum of nodes.

        Without this new guarantee it is hard for readers to function correctly. The reader does not know that the quorum write failed, or is still in progress, so without reading at ALL, the R+W>N guarantee does not help the reader.

        Show
        Sean Bridges added a comment - - edited To be clear, this is a new guarantee. The current guarantee is R+W>N gives you consistency. This bug is asking that a successful quorum read of A means that A has been committed to a quorum of nodes. "How can we ensure the quorum read property that you want ?" If when reading at quorum, and no quorum can be found which agrees on a particular value, then the coordinator ( ? ) will wait for acks of read repair writes (or perhaps just do normal writes) to be returned from a sufficient number of nodes to ensure that the value has been committed to a quorum of nodes. Without this new guarantee it is hard for readers to function correctly. The reader does not know that the quorum write failed, or is still in progress, so without reading at ALL, the R+W>N guarantee does not help the reader.
        Hide
        Peter Schuller added a comment -

        The issue is that of failed QUORUM writes. I.e., you design your system to use QUORUM writes and QUORUM reads, and expect that once a QUORUM read sees a given piece of data a subsequent QUORUM read will also see it (or a later data). A failed QUORUM write that was replicated to less than a QUORUM would be visible as part of QUORUM reads that happen to touch one of those replicas, but there is no guarantee that subsequent reads see it.

        I was under the impression this was never an intended guarantee. Apparently I may be wrong about that given the jbellis quote above. In either case, it is certainly not an actual guarantee given by the current implementation.

        The guarantee that a successful QUORUM write is seen by a subsequent QUORUM read is, as far as I can tell, not in question here.

        Show
        Peter Schuller added a comment - The issue is that of failed QUORUM writes. I.e., you design your system to use QUORUM writes and QUORUM reads, and expect that once a QUORUM read sees a given piece of data a subsequent QUORUM read will also see it (or a later data). A failed QUORUM write that was replicated to less than a QUORUM would be visible as part of QUORUM reads that happen to touch one of those replicas, but there is no guarantee that subsequent reads see it. I was under the impression this was never an intended guarantee. Apparently I may be wrong about that given the jbellis quote above. In either case, it is certainly not an actual guarantee given by the current implementation. The guarantee that a successful QUORUM write is seen by a subsequent QUORUM read is, as far as I can tell, not in question here.
        Hide
        Sylvain Lebresne added a comment -

        The problem is you are considering the consistency of reads but not write. The guarantee is: "quorum reads will not see old quorum write once a quorum read sees a new quorum". Period. I you don't consider the consistency of a write, consider the case of a CL.ANY write. In this case, the update may not be at all on any replica. How can we ensure the quorum read property that you want ? We query all nodes for quorum reads in case there is an hint somewhere ?

        If you look at the Consistency part of http://wiki.apache.org/cassandra/ArchitectureOverview, it seems to me that it is pretty clear that the consistency of reads and writes is involved to achieve strong consistency. So I would hope 'most people' are aware of that.

        Show
        Sylvain Lebresne added a comment - The problem is you are considering the consistency of reads but not write. The guarantee is: "quorum reads will not see old quorum write once a quorum read sees a new quorum". Period. I you don't consider the consistency of a write, consider the case of a CL.ANY write. In this case, the update may not be at all on any replica. How can we ensure the quorum read property that you want ? We query all nodes for quorum reads in case there is an hint somewhere ? If you look at the Consistency part of http://wiki.apache.org/cassandra/ArchitectureOverview , it seems to me that it is pretty clear that the consistency of reads and writes is involved to achieve strong consistency. So I would hope 'most people' are aware of that.
        Hide
        Sean Bridges added a comment -

        I think the guarantee of quorum reads not seeing old writes once a quorum read sees a new write is very useful. I suspect most people already think that this guarantee occurs, including, it seems, Jonathan Ellis whose quote can be found in the email thread linked to in the bug,

        "The important guarantee this gives you is that once one quorum read sees the new value, all others will too. You can't see the newest version, then see an older version on a subsequent write [sic, I
        assume he meant read], which is the characteristic of non-strong consistency"

        Show
        Sean Bridges added a comment - I think the guarantee of quorum reads not seeing old writes once a quorum read sees a new write is very useful. I suspect most people already think that this guarantee occurs, including, it seems, Jonathan Ellis whose quote can be found in the email thread linked to in the bug, "The important guarantee this gives you is that once one quorum read sees the new value, all others will too. You can't see the newest version, then see an older version on a subsequent write [sic, I assume he meant read], which is the characteristic of non-strong consistency"
        Hide
        Peter Schuller added a comment -

        Ok, so my last suggestion is in fact broken. A counter example is:

        A: column @ t1
        B: column @ t2
        C: column @ t3

        If A + B is participating, A's column @ t1 has timestamp quorum and would be selected. If B + C is participating, B's column is picked. Thus, a read where B + C participates will see data that will be reverted once A + B happens to be picked.

        Note to self: Think before posting.

        Show
        Peter Schuller added a comment - Ok, so my last suggestion is in fact broken. A counter example is: A: column @ t1 B: column @ t2 C: column @ t3 If A + B is participating, A's column @ t1 has timestamp quorum and would be selected. If B + C is participating, B's column is picked. Thus, a read where B + C participates will see data that will be reverted once A + B happens to be picked. Note to self: Think before posting.
        Hide
        Peter Schuller added a comment -

        I don't think anyone is claiming otherwise, unless I'm misunderstanding. The problem is that while the "if sucessfully written to quorum, subsequent quorum reads will see it" guarantee is indeed maintained, it is possible for quorum reads to see data go backwards (on a timeline) in the event of a failed attempted quorum write. This includes the possibility of reads seeing data that then permanently vanishes, even though you only lost say 1 node that you designed your cluster for surviving (RF >= 3, QUORUM). ("lost 1 node" can be substituted with "killed 1 node in periodic commit mode")

        I still don't think this is a violation of what was promised, but I can see how making the further guarantee would make for more useful consistency semantics in some cases.

        With respect to implicit write: An alternative is to adjust reconciliation logic when applied as part of reads (as opposed to AES, hinted hand-off, writes) to take consistency level into account and only consider columns whose timestamp is >= the greatest timestamp that has quorum (off the top of my head I think that should be correct in call cases, but I didn't think this through terribly).

        Show
        Peter Schuller added a comment - I don't think anyone is claiming otherwise, unless I'm misunderstanding. The problem is that while the "if sucessfully written to quorum, subsequent quorum reads will see it" guarantee is indeed maintained, it is possible for quorum reads to see data go backwards (on a timeline) in the event of a failed attempted quorum write. This includes the possibility of reads seeing data that then permanently vanishes, even though you only lost say 1 node that you designed your cluster for surviving (RF >= 3, QUORUM). ("lost 1 node" can be substituted with "killed 1 node in periodic commit mode") I still don't think this is a violation of what was promised, but I can see how making the further guarantee would make for more useful consistency semantics in some cases. With respect to implicit write: An alternative is to adjust reconciliation logic when applied as part of reads (as opposed to AES, hinted hand-off, writes) to take consistency level into account and only consider columns whose timestamp is >= the greatest timestamp that has quorum (off the top of my head I think that should be correct in call cases, but I didn't think this through terribly).
        Hide
        Stu Hood added a comment - - edited

        W plus R must be greater than N for consistency.

        EDIT: And adding a blocking implicit write step to QUORUM reads by waiting for read repair is not reasonable.

        Show
        Stu Hood added a comment - - edited W plus R must be greater than N for consistency. EDIT: And adding a blocking implicit write step to QUORUM reads by waiting for read repair is not reasonable.
        Hide
        Jeremiah Jordan added a comment -

        I would think that reads at QUORUM should never go backwards. Even if the Write was at ZERO. If there were writes to the cluster of a=1 time=5, a=2 time=10, a=3 time=15, and I do a read at QUORUM which tells me a=3 time=15, I should not be able to do another read at QUORUM and get a=2 time=10.

        Show
        Jeremiah Jordan added a comment - I would think that reads at QUORUM should never go backwards. Even if the Write was at ZERO. If there were writes to the cluster of a=1 time=5, a=2 time=10, a=3 time=15, and I do a read at QUORUM which tells me a=3 time=15, I should not be able to do another read at QUORUM and get a=2 time=10.
        Hide
        Sean Bridges added a comment -

        Peter Shuller wrote,

        "However, it sounds like what is being asked for is not that they don't propagate in the event of a write "failure", but just that reads don't see the writes until they are sufficiently propagated to guarantee that any future QUORUM read will also see the data."

        Yes, that is the issue. The comment in the bug about writing at ONE and reading at QUORUM is just a way of testing this new guarantee in a distributed test, if Cassandra has those.

        Show
        Sean Bridges added a comment - Peter Shuller wrote, "However, it sounds like what is being asked for is not that they don't propagate in the event of a write "failure", but just that reads don't see the writes until they are sufficiently propagated to guarantee that any future QUORUM read will also see the data." Yes, that is the issue. The comment in the bug about writing at ONE and reading at QUORUM is just a way of testing this new guarantee in a distributed test, if Cassandra has those.
        Hide
        Peter Schuller added a comment -

        As far as I can tell the consistency being asked for was never promised by Cassandra is in fact not expected.

        The expected behavior of writes is that they propagate; the difference between ONE and QUORUM is just how many are required to receive a write prior to a return to the client with a successful error code. For reads, that means you may get lucky at ONE or you may get lucky at QUORUM; the positive guarantee is in the case of a completing QUORUM write followed by a QUORUM read.

        So just to be clear, although I don't think this is what is being asked for: As far as I know, it has never been the case, nor the intent to promise, that a write which fails is guaranteed not to eventually complete. Simply "fixing" reads is not enough; by design the data will be replicated during read-repair and AES - this is how consistency is achieved in Cassandra.

        However, it sounds like what is being asked for is not that they don't propagate in the event of a write "failure", but just that reads don't see the writes until they are sufficiently propagated to guarantee that any future QUORUM read will also see the data. I can understand that is desirable, in the sense of achieving monotonically forward-moving data as the benchmark/test from the e-mail thread does. Another way to look at is that maybe you never want to read data successfully prior to achieving a certain level of replication, in order to avoid a client ever seeing data that may suddenly go away due to e.g. a node failure in spite of said failure not exceeding the number of failures the cluster was designed to survive.

        So the key point would be the bit about guaranteeing that any "future QUORUM read will see the data or data subsequently overwritten", and actively read-repairing and waiting for it to happen would take care of that. It would be important to ensure that the act of ensuring a quorum of nodes have seen the data is the important part; one should not await for a quorum to agree on the current version of the data as that would create potentially unbounded round-trips on hotly contended data.

        Thing to consider: One might think about cases where read-repair is currently not done, like range slices, and how an implementation that requires read repair for consistency affects that.

        Show
        Peter Schuller added a comment - As far as I can tell the consistency being asked for was never promised by Cassandra is in fact not expected. The expected behavior of writes is that they propagate; the difference between ONE and QUORUM is just how many are required to receive a write prior to a return to the client with a successful error code. For reads, that means you may get lucky at ONE or you may get lucky at QUORUM; the positive guarantee is in the case of a completing QUORUM write followed by a QUORUM read. So just to be clear, although I don't think this is what is being asked for: As far as I know, it has never been the case, nor the intent to promise, that a write which fails is guaranteed not to eventually complete. Simply "fixing" reads is not enough; by design the data will be replicated during read-repair and AES - this is how consistency is achieved in Cassandra. However, it sounds like what is being asked for is not that they don't propagate in the event of a write "failure", but just that reads don't see the writes until they are sufficiently propagated to guarantee that any future QUORUM read will also see the data. I can understand that is desirable, in the sense of achieving monotonically forward-moving data as the benchmark/test from the e-mail thread does. Another way to look at is that maybe you never want to read data successfully prior to achieving a certain level of replication, in order to avoid a client ever seeing data that may suddenly go away due to e.g. a node failure in spite of said failure not exceeding the number of failures the cluster was designed to survive. So the key point would be the bit about guaranteeing that any "future QUORUM read will see the data or data subsequently overwritten", and actively read-repairing and waiting for it to happen would take care of that. It would be important to ensure that the act of ensuring a quorum of nodes have seen the data is the important part; one should not await for a quorum to agree on the current version of the data as that would create potentially unbounded round-trips on hotly contended data. Thing to consider: One might think about cases where read-repair is currently not done, like range slices, and how an implementation that requires read repair for consistency affects that.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Sean Bridges
            Reviewer:
            Sylvain Lebresne
          • Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development