Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      all

      Description

      Writes to a row today can be run on any of the replicas that own the row. An additional set of APIs to perform "mastered" writes that funnel through a primary is important if applications have some operations that require higher consistency. Test-and-set is an example of one such operation that requires a higher consistency guarantee.

      To stay true to Cassandra's performance goals, this should be done in a way that does not compromise performance for apps that can deal with lower consistency and never use these APIs. That said, an app that mixes higher consistency calls with lower consistency calls should be careful that they don't operate on the same data.

      1. 225.patch
        13 kB
        Sandeep Tata

        Activity

        Hide
        jbellis Jonathan Ellis added a comment - - edited

        Per Jun and my conversation with Cliff Moon at the hackathon, this is not something we want to pursue in Cassandra.

        Show
        jbellis Jonathan Ellis added a comment - - edited Per Jun and my conversation with Cliff Moon at the hackathon, this is not something we want to pursue in Cassandra.
        Hide
        jbellis Jonathan Ellis added a comment -

        1. 5 seconds to elect a new master under ideal conditions isn't the point. The point is that when you partition, the side with a minority of nodes can't accept writes indefinitely. Or reads either, depending on how strict you are being.

        This alone is a deal-breaker for me; "always writable" is a major design goal for both Dynamo and Cassandra. (If you don't require a majority to elect a master, it's not really a master and all your benefits go away.) This is the wrong direction to move in; I think Jonas from Google was right when he said at NoSQL that the world is moving towards eventual consistency.

        2. Absent a full, production-tested implementation of mastered writes (the devil is in the details!) there is no way to say for sure, but my intuition is that eventual consistency + repair is going to be significantly less complex. And trying to support both models as Sandeep originally suggested is just a nightmare.

        Show
        jbellis Jonathan Ellis added a comment - 1. 5 seconds to elect a new master under ideal conditions isn't the point. The point is that when you partition, the side with a minority of nodes can't accept writes indefinitely. Or reads either, depending on how strict you are being. This alone is a deal-breaker for me; "always writable" is a major design goal for both Dynamo and Cassandra. (If you don't require a majority to elect a master, it's not really a master and all your benefits go away.) This is the wrong direction to move in; I think Jonas from Google was right when he said at NoSQL that the world is moving towards eventual consistency. 2. Absent a full, production-tested implementation of mastered writes (the devil is in the details!) there is no way to say for sure, but my intuition is that eventual consistency + repair is going to be significantly less complex. And trying to support both models as Sandeep originally suggested is just a nightmare.
        Hide
        junrao Jun Rao added a comment -

        I felt that we didn't give this one enough thought. The 2 criticisms of sending writes to a master node and let it propagate the writes to the secondaries are the following:
        1. This reduces availability on master node failure.
        2. The implementation can be complicated.

        For 1, we really have to factor in the time for failure detection. If it takes 5 seconds to deect a node failure and re-electing a master takes only 1 second, the availability from the client's perspective is not reduced much.

        For 2, it is true that the implementation of mastered writes is more complicated than sending writes independently to all nodes. On the other hand, this makes it possible to efficiently sync up the replicas by just shipping portions of the commit logs. By relying on this single mechanism to resolve data inconsistency, we can get rid of passive read repair (which consumes resources on every read), hinted handoff, and the active repair that we are working on. Overall, the code complexity could be reduced.

        I see a couple of other benefits of mastered writes.
        a. This makes the handling of deletes easier. By synchronizing the logs among replicas, we can be sure when a tombstone can be safely discarded. This is better than our current approach that relies on a configured window. See the "handling deletes" thread we had earlier (http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200904.mbox/browser).
        b. This makes it possible for a replica to reason about how out of date it is and therefore provides some sort of timeline consistency.

        Show
        junrao Jun Rao added a comment - I felt that we didn't give this one enough thought. The 2 criticisms of sending writes to a master node and let it propagate the writes to the secondaries are the following: 1. This reduces availability on master node failure. 2. The implementation can be complicated. For 1, we really have to factor in the time for failure detection. If it takes 5 seconds to deect a node failure and re-electing a master takes only 1 second, the availability from the client's perspective is not reduced much. For 2, it is true that the implementation of mastered writes is more complicated than sending writes independently to all nodes. On the other hand, this makes it possible to efficiently sync up the replicas by just shipping portions of the commit logs. By relying on this single mechanism to resolve data inconsistency, we can get rid of passive read repair (which consumes resources on every read), hinted handoff, and the active repair that we are working on. Overall, the code complexity could be reduced. I see a couple of other benefits of mastered writes. a. This makes the handling of deletes easier. By synchronizing the logs among replicas, we can be sure when a tombstone can be safely discarded. This is better than our current approach that relies on a configured window. See the "handling deletes" thread we had earlier ( http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200904.mbox/browser ). b. This makes it possible for a replica to reason about how out of date it is and therefore provides some sort of timeline consistency.
        Hide
        sandeep_tata Sandeep Tata added a comment -

        These are interesting arguments.

        I'm happy to leave mastered writes + simple transactions out of trunk if no one else is dying to have these features

        >Perhaps not, but if there is an approach worth trading clean design away for, this isn't it. The approach here only gives you the ability to have better atomicity across keys on the same master node, which is dubiously useful. Definitely not worth the price.

        I don't think it is that obvious. At least, not yet.
        See: http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html
        Any guesses on why they have this rather annoying abstraction ?

        I'm neither recommending nor opposing this abstraction (or lack of it!). Just pointing out that some people seem to think that you can process simple transactions scalably for much cheaper this way.

        Show
        sandeep_tata Sandeep Tata added a comment - These are interesting arguments. I'm happy to leave mastered writes + simple transactions out of trunk if no one else is dying to have these features >Perhaps not, but if there is an approach worth trading clean design away for, this isn't it. The approach here only gives you the ability to have better atomicity across keys on the same master node, which is dubiously useful. Definitely not worth the price. I don't think it is that obvious. At least, not yet. See: http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html Any guesses on why they have this rather annoying abstraction ? I'm neither recommending nor opposing this abstraction (or lack of it!). Just pointing out that some people seem to think that you can process simple transactions scalably for much cheaper this way.
        Hide
        jbellis Jonathan Ellis added a comment -

        sorry, should s/atomic/strongly consistent/ above. Neither block_for=N nor "mastered writes" give you atomicity. (Consider the x+=1 case Sandeep mentions.)

        Show
        jbellis Jonathan Ellis added a comment - sorry, should s/atomic/strongly consistent/ above. Neither block_for=N nor "mastered writes" give you atomicity. (Consider the x+=1 case Sandeep mentions.)
        Hide
        jbellis Jonathan Ellis added a comment -

        > You are subject to reduced availability only for the portions of your application that use the higher consistency APIs.

        (a) that's not the case in your patch, but let's accept that it's possible

        (b) you can already get per-key high consistency (with the same availability price) by specifying block_for=N

        > So, we can decide we never want atomicity guarantees for anything other than a single insert or get.

        Perhaps not, but if there is an approach worth trading clean design away for, this isn't it. The approach here only gives you the ability to have better atomicity across keys on the same master node, which is dubiously useful. Definitely not worth the price.

        > BTW using ZK based locks will lead to the very same availability loss

        Yes, you always lose availability when you choose strong consistency instead. I'm aware of the CAP theorem.

        > except at much worse performance.

        http://hadoop.apache.org/zookeeper/docs/current/zookeeperOver.html#Performance leads me to believe that ZK-based locking will be fine for a lot of people, but my position does not change if I am wrong on that point.

        Trading performance for clean design is often acceptable. This is what Megastore does, layering transactions on top of Bigtable. They get terrible write performance (if AppEngine is indeed built on Megastore, which seems highly probable) but that's okay. At least for some apps.

        In the spirit of Megastore (http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx), doing block_for=N writes onto a CommitLog column family might be an interesting approach. It's not clear to me though that reader transactions could avoid partial reads. (Checking the xlog version before and after read might do the trick, though... just throwing out ideas.)

        If neither of those turn out to be acceptable, I'm okay with that. It's far better to have a clear design vision than to try to be all things to all people.

        Show
        jbellis Jonathan Ellis added a comment - > You are subject to reduced availability only for the portions of your application that use the higher consistency APIs. (a) that's not the case in your patch, but let's accept that it's possible (b) you can already get per-key high consistency (with the same availability price) by specifying block_for=N > So, we can decide we never want atomicity guarantees for anything other than a single insert or get. Perhaps not, but if there is an approach worth trading clean design away for, this isn't it. The approach here only gives you the ability to have better atomicity across keys on the same master node, which is dubiously useful. Definitely not worth the price. > BTW using ZK based locks will lead to the very same availability loss Yes, you always lose availability when you choose strong consistency instead. I'm aware of the CAP theorem. > except at much worse performance. http://hadoop.apache.org/zookeeper/docs/current/zookeeperOver.html#Performance leads me to believe that ZK-based locking will be fine for a lot of people, but my position does not change if I am wrong on that point. Trading performance for clean design is often acceptable. This is what Megastore does, layering transactions on top of Bigtable. They get terrible write performance (if AppEngine is indeed built on Megastore, which seems highly probable) but that's okay. At least for some apps. In the spirit of Megastore ( http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx ), doing block_for=N writes onto a CommitLog column family might be an interesting approach. It's not clear to me though that reader transactions could avoid partial reads. (Checking the xlog version before and after read might do the trick, though... just throwing out ideas.) If neither of those turn out to be acceptable, I'm okay with that. It's far better to have a clear design vision than to try to be all things to all people.
        Hide
        sandeep_tata Sandeep Tata added a comment -

        You are subject to reduced availability only for the portions of your application that use the higher consistency APIs.

        There is no way, for instance, to implement a test-and-set primitive on a store that can only provide eventual consistency guarantees. Many applications will have small parts that need to have atomicity+isolation guarantees. Example: "read X, write X+1" .

        So, we can decide we never want atomicity guarantees for anything other than a single insert or get. This will severely limit what apps can run on Cassandra. (BTW using ZK based locks will lead to the very same availability loss, except at much worse performance.)

        I'm suggesting, we make a trade-off to support a small set of calls with better consistency guarantees so we can get a real test-and-set. If a failure + recovery happens before your app makes two consecutive calls to a higher consistency level call, you never perceive any loss of availability.

        Show
        sandeep_tata Sandeep Tata added a comment - You are subject to reduced availability only for the portions of your application that use the higher consistency APIs. There is no way, for instance, to implement a test-and-set primitive on a store that can only provide eventual consistency guarantees. Many applications will have small parts that need to have atomicity+isolation guarantees. Example: "read X, write X+1" . So, we can decide we never want atomicity guarantees for anything other than a single insert or get. This will severely limit what apps can run on Cassandra. (BTW using ZK based locks will lead to the very same availability loss, except at much worse performance.) I'm suggesting, we make a trade-off to support a small set of calls with better consistency guarantees so we can get a real test-and-set. If a failure + recovery happens before your app makes two consecutive calls to a higher consistency level call, you never perceive any loss of availability.
        Hide
        jbellis Jonathan Ellis added a comment -

        I'm a strong -1 on patches that compromise availability in the face of a single node failure or require complex, error-prone code to work around such compromises.

        (Options that drastically change behavior are also a code smell.)

        This is a square peg, round hole problem. It's fundamentally at odds with Cassandra's design.

        HBase is a much better fit for what you want since it already has a single-master design.

        Show
        jbellis Jonathan Ellis added a comment - I'm a strong -1 on patches that compromise availability in the face of a single node failure or require complex, error-prone code to work around such compromises. (Options that drastically change behavior are also a code smell.) This is a square peg, round hole problem. It's fundamentally at odds with Cassandra's design. HBase is a much better fit for what you want since it already has a single-master design.
        Hide
        sandeep_tata Sandeep Tata added a comment - - edited

        Okay this is an ugly first cut, but I want to put it out there so you guys have a chance to provide comments on the design as I hack up the rest of this feature.

        Basic idea — calls that go to the primary (currently defined as the first endpoint in the list) are applied locally then asynchronously sent to the other replicas. If the node is not the primary, it forwards the request to the primary and waits for a response before acking.

        1. I didn't add a whole bunch of calls in the interface – I stole the block=1 values to mean mastered writes for now. This is not unreasonable since even non-blocking writes give you a read-your-writes semantics. block=1 doesn't really mean much right now. Of course, this is not clean, and I expect to change it once we're happy to expose this in the interface. You need to turn on "MasteredUpdatesForBlockOne" in the conf file to use it.

        2. This does not (yet) work in the presence of failures. It is possible that some failure scenarios lead to a state where 2 nodes both think they're "masters". The easiest way to solve this is using a safe leader-election algorithm using Zookeeper. That'll have to be in round 2 of the patch.

        Of course, if you don't turn on MasteredUpdatesForBlockOne, you never touch this code path.

        Show
        sandeep_tata Sandeep Tata added a comment - - edited Okay this is an ugly first cut, but I want to put it out there so you guys have a chance to provide comments on the design as I hack up the rest of this feature. Basic idea — calls that go to the primary (currently defined as the first endpoint in the list) are applied locally then asynchronously sent to the other replicas. If the node is not the primary, it forwards the request to the primary and waits for a response before acking. 1. I didn't add a whole bunch of calls in the interface – I stole the block=1 values to mean mastered writes for now. This is not unreasonable since even non-blocking writes give you a read-your-writes semantics. block=1 doesn't really mean much right now. Of course, this is not clean, and I expect to change it once we're happy to expose this in the interface. You need to turn on "MasteredUpdatesForBlockOne" in the conf file to use it. 2. This does not (yet) work in the presence of failures. It is possible that some failure scenarios lead to a state where 2 nodes both think they're "masters". The easiest way to solve this is using a safe leader-election algorithm using Zookeeper. That'll have to be in round 2 of the patch. Of course, if you don't turn on MasteredUpdatesForBlockOne, you never touch this code path.

          People

          • Assignee:
            Unassigned
            Reporter:
            sandeep_tata Sandeep Tata
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development