Details

    • Type: Wish
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Fix Version/s: 4.x
    • Component/s: None
    • Labels:
      None

      Description

      We should take a look at RAMP transactions, and figure out if they can be used to provide more efficient LWT (or LWT-like) operations.

        Issue Links

        1. Batching SELECTs Sub-task Open Unassigned
         

          Activity

          Hide
          pbailis Peter Bailis added a comment -

          As I mentioned at the Next-Generation Cassandra Conference, I'm happy to get the ball rolling on an implementation of RAMP in Cassandra.

          To reiterate a few points from the NGCC, I think RAMP could provide some useful isolation guarantees for Cassandra's Atomic Batch operations (either none of the updates will be visible, or all are visible) as well as provide the basis for "consistent" global secondary index updates in Cassandra-6477. I've posted my slides from the NGCC on SpeakerDeck; the Cassandra-specific implementation details start on transition number 287.
          https://speakerdeck.com/pbailis/scalable-atomic-visibility-with-ramp-transactions

          I have some time to hack on this and am willing to work on a patch and/or hammer out the Cassandra-specific design with you all over JIRA or otherwise!

          Show
          pbailis Peter Bailis added a comment - As I mentioned at the Next-Generation Cassandra Conference, I'm happy to get the ball rolling on an implementation of RAMP in Cassandra. To reiterate a few points from the NGCC, I think RAMP could provide some useful isolation guarantees for Cassandra's Atomic Batch operations (either none of the updates will be visible, or all are visible) as well as provide the basis for "consistent" global secondary index updates in Cassandra-6477. I've posted my slides from the NGCC on SpeakerDeck; the Cassandra-specific implementation details start on transition number 287. https://speakerdeck.com/pbailis/scalable-atomic-visibility-with-ramp-transactions I have some time to hack on this and am willing to work on a patch and/or hammer out the Cassandra-specific design with you all over JIRA or otherwise!
          Hide
          benedict Benedict added a comment -

          We were discussing this internally just a few days ago. I'm very keen to see this introduced, as I think it could have tremendous potential. There has been a side discussion about CASSANDRA-6108 and whether this would make an implementation simpler, by virtue of providing a unique commit id that is more robust than a server-generated timestamp, however I am of the opinion this could be worked in later. Using the timestamp either way certainly seems the easiest solution, it will just benefit from improved timestamps when we get them.

          One important question for me is if we maintain a separate expired-read-buffer from the write-buffer; optimally we would clear records from the write buffer as soon as they make it into memtables, only we them need to track values that are overwritten in a separate read-buffer. It might be slightly easier to simply keep them longer in the write-buffer, however this could lead to significantly larger memory overheads, as we keep all writes twice (instead of only those that are overwritten)

          Either way, I'm currently of the opinion we should target either a very narrow expired-read-buffer window, or one with a fixed size, so that we can keep a tight bound on the resources dedicated to these transactions. We also need to take care with how we safely inform a reader that their read could not be safely serviced from this window so that they may retry, and to fail if reads consistently fail to reach consensus.

          There are some related problems as well, namely how we expose this functionality to the user. Currently we have no concept of a "batched read" so this might need protocol support, but that's probably a separate discussion/problem. As far as writes are concerned, I'd be inclined to simply replace current LOGGED batches entirely.

          Show
          benedict Benedict added a comment - We were discussing this internally just a few days ago. I'm very keen to see this introduced, as I think it could have tremendous potential. There has been a side discussion about CASSANDRA-6108 and whether this would make an implementation simpler, by virtue of providing a unique commit id that is more robust than a server-generated timestamp, however I am of the opinion this could be worked in later. Using the timestamp either way certainly seems the easiest solution, it will just benefit from improved timestamps when we get them. One important question for me is if we maintain a separate expired-read-buffer from the write-buffer; optimally we would clear records from the write buffer as soon as they make it into memtables, only we them need to track values that are overwritten in a separate read-buffer. It might be slightly easier to simply keep them longer in the write-buffer, however this could lead to significantly larger memory overheads, as we keep all writes twice (instead of only those that are overwritten) Either way, I'm currently of the opinion we should target either a very narrow expired-read-buffer window, or one with a fixed size, so that we can keep a tight bound on the resources dedicated to these transactions. We also need to take care with how we safely inform a reader that their read could not be safely serviced from this window so that they may retry, and to fail if reads consistently fail to reach consensus. There are some related problems as well, namely how we expose this functionality to the user. Currently we have no concept of a "batched read" so this might need protocol support, but that's probably a separate discussion/problem. As far as writes are concerned, I'd be inclined to simply replace current LOGGED batches entirely.
          Hide
          jbellis Jonathan Ellis added a comment -

          Using the timestamp either way certainly seems the easiest solution, it will just benefit from improved timestamps when we get them.

          To be clear, using non-unique ts as RAMP id is broken, but I agree that we should proceed here with the assumption that we'll solve the unique ts problem; if that doesn't work out we can figure out a plan B.

          Currently we have no concept of a "batched read" so this might need protocol support

          Should we just make it automatic for IN queries? That would leave the option of doing a bunch of asynchronous SELECTs if you wanted to opt out.

          As far as writes are concerned, I'd be inclined to simply replace current LOGGED batches entirely.

          Agreed.

          Show
          jbellis Jonathan Ellis added a comment - Using the timestamp either way certainly seems the easiest solution, it will just benefit from improved timestamps when we get them. To be clear, using non-unique ts as RAMP id is broken, but I agree that we should proceed here with the assumption that we'll solve the unique ts problem; if that doesn't work out we can figure out a plan B. Currently we have no concept of a "batched read" so this might need protocol support Should we just make it automatic for IN queries? That would leave the option of doing a bunch of asynchronous SELECTs if you wanted to opt out. As far as writes are concerned, I'd be inclined to simply replace current LOGGED batches entirely. Agreed.
          Hide
          benedict Benedict added a comment -

          Should we just make it automatic for IN queries? That would leave the option of doing a bunch of asynchronous SELECTs if you wanted to opt out.

          This is probably the easiest first step, but batches support hitting multiple tables simultaneously, so ideally we would support reads that do the same.

          Show
          benedict Benedict added a comment - Should we just make it automatic for IN queries? That would leave the option of doing a bunch of asynchronous SELECTs if you wanted to opt out. This is probably the easiest first step, but batches support hitting multiple tables simultaneously, so ideally we would support reads that do the same.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Or, alternatively, we just don't invent new unnecessary concepts (batch reads) to justify hypothetical things we could do that nobody asked us for.

          RAMP does sound very useful for proper global indexes and materialized views, however. If RAMP in C* is at all feasible, then maybe we should start with a proper implementation of global indexes (based on RAMP) rather than wasting time on a EC solution that will be thrown out eventually?

          Show
          iamaleksey Aleksey Yeschenko added a comment - Or, alternatively, we just don't invent new unnecessary concepts (batch reads) to justify hypothetical things we could do that nobody asked us for. RAMP does sound very useful for proper global indexes and materialized views, however. If RAMP in C* is at all feasible, then maybe we should start with a proper implementation of global indexes (based on RAMP) rather than wasting time on a EC solution that will be thrown out eventually?
          Hide
          slebresne Sylvain Lebresne added a comment -

          maybe we should start with a proper implementation of global indexes (based on RAMP) rather than wasting time on a EC solution that will be thrown out eventually?

          If the global indexes implementation rely too much on the details of using RAMP (versus simply using our current batchlog for instance), we're probably doing it wrong. So I'd rather not put an official stamp on "global indexes requires RAMP first" (of course, if it happens that RAMP makes it first, let's by all mean use it right away).

          Show
          slebresne Sylvain Lebresne added a comment - maybe we should start with a proper implementation of global indexes (based on RAMP) rather than wasting time on a EC solution that will be thrown out eventually? If the global indexes implementation rely too much on the details of using RAMP (versus simply using our current batchlog for instance), we're probably doing it wrong. So I'd rather not put an official stamp on "global indexes requires RAMP first" (of course, if it happens that RAMP makes it first, let's by all mean use it right away).
          Hide
          tupshin Tupshin Harper added a comment - - edited

          Cross table consistent reads are of fundamental importance.

          Once you allow that they are useful for consistent index reads, then you have admitted that they are useful for for direct consumption by users, since we are constantly advising them to build their own index solutions since 2i are horrendously weak. That pressure will be only slightly reduced with global indexes.

          Even separate from custom (client-side) 2i implementations, having all or nothing read visibility of writes spanning tables captures fundamental business logic that is either painfully worked around today, or else is glossed over as statistically unlikely (depending on the r/w patterns) and the race conditions duly ignored.

          It would be a tragic mistake to ignore the benefits of the gains in correctness that can be achieved.

          Show
          tupshin Tupshin Harper added a comment - - edited Cross table consistent reads are of fundamental importance. Once you allow that they are useful for consistent index reads, then you have admitted that they are useful for for direct consumption by users, since we are constantly advising them to build their own index solutions since 2i are horrendously weak. That pressure will be only slightly reduced with global indexes. Even separate from custom (client-side) 2i implementations, having all or nothing read visibility of writes spanning tables captures fundamental business logic that is either painfully worked around today, or else is glossed over as statistically unlikely (depending on the r/w patterns) and the race conditions duly ignored. It would be a tragic mistake to ignore the benefits of the gains in correctness that can be achieved.
          Hide
          tjake T Jake Luciani added a comment -

          Tupshin Harper If we change the current LOGGED batches to include RAMP then it will work with Global Indexes pretty simply.

          Show
          tjake T Jake Luciani added a comment - Tupshin Harper If we change the current LOGGED batches to include RAMP then it will work with Global Indexes pretty simply.
          Hide
          jjordan Jeremiah Jordan added a comment -

          We need to be careful about adding RAMP to things automatically. RAMP has a requirement that anything being read/written that way is always written in the same groupings. If you update B,C and then update A,B. You can't read B,C anymore successfully, as the times on B and C will never match.

          Show
          jjordan Jeremiah Jordan added a comment - We need to be careful about adding RAMP to things automatically. RAMP has a requirement that anything being read/written that way is always written in the same groupings. If you update B,C and then update A,B. You can't read B,C anymore successfully, as the times on B and C will never match.
          Hide
          pbailis Peter Bailis added a comment - - edited

          > RAMP has a requirement that anything being read/written that way is always written in the same groupings. If you update B,C and then update A,B. You can't read B,C anymore successfully, as the times on B and C will never match.

          This isn't entirely correct. Let's say I do an atomic batch B1 that writes B = 1 and C = 1 with timestamp 1, then you do an atomic batch B2 that writes A = 2 and B = 2 at timestamp 2. Under RAMP, subsequent batch reads from B and C will return B = 2, C = 1. The timestamps on B and C will indeed (as you point out) not match, but simply returning matching timestamps is not the goal: the goal is that if you read any write in a given batch, you will be able to read the rest of the writes in the batch (i.e., if you also attempt to read any other items that were written in the batch, you will see the corresponding writes).

          Show
          pbailis Peter Bailis added a comment - - edited > RAMP has a requirement that anything being read/written that way is always written in the same groupings. If you update B,C and then update A,B. You can't read B,C anymore successfully, as the times on B and C will never match. This isn't entirely correct. Let's say I do an atomic batch B1 that writes B = 1 and C = 1 with timestamp 1, then you do an atomic batch B2 that writes A = 2 and B = 2 at timestamp 2. Under RAMP, subsequent batch reads from B and C will return B = 2, C = 1. The timestamps on B and C will indeed (as you point out) not match, but simply returning matching timestamps is not the goal: the goal is that if you read any write in a given batch, you will be able to read the rest of the writes in the batch (i.e., if you also attempt to read any other items that were written in the batch, you will see the corresponding writes).
          Hide
          jjordan Jeremiah Jordan added a comment -

          If you add a B,C write at time 1.5, how do you know you are getting the right C? If B says it was written with A @2 and the C you read says it was written with B @1? You lost the info that the real C you should be getting is the one from B,C@1.5.

          Show
          jjordan Jeremiah Jordan added a comment - If you add a B,C write at time 1.5, how do you know you are getting the right C? If B says it was written with A @2 and the C you read says it was written with B @1? You lost the info that the real C you should be getting is the one from B,C@1.5.
          Hide
          benedict Benedict added a comment -

          You will read timestamp 2 as the latest value, and will request the latest value as of that timestamp, which will be 1.5 for C; since 1.5 has to be visible (if it was written with RAMP transactions and you've seen it, it's visible) you'll get correct behaviour. If it isn't written with a RAMP transaction, it's undefined which you see, and that is also correct.

          Show
          benedict Benedict added a comment - You will read timestamp 2 as the latest value, and will request the latest value as of that timestamp, which will be 1.5 for C; since 1.5 has to be visible (if it was written with RAMP transactions and you've seen it, it's visible) you'll get correct behaviour. If it isn't written with a RAMP transaction, it's undefined which you see, and that is also correct.
          Hide
          pbailis Peter Bailis added a comment -

          Jeremiah Jordan Good question. The short answer is that this behavior (reading A @2 and C@1) is well-defined under RAMP. Just like in Cassandra today, the fact that I read a write at time 500 doesn't mean I'm going to see the effects of all writes that occur before time 500! Rather, the guarantee that RAMP adds is that, once you see the effects of one write in the the batch, you'll see all of the writes in the batch.

          So, in your scenario, you have three batches: B1

          {A=1, B=1}

          at time 1, B1.5

          {B=1.5, C=1.5}

          at time 1.5, and B2

          {A=2, B=2}

          at time 2. You could get the behavior you describe above if B1 executes and completes, B2 executes and complete, and we subsequently read sometime before B1.5 completes. So, I guess I disagree that "the real C you should be getting is the one from [the batch at time 1.5]" because you didn't yet see the effect of any writes from B1.5. However, once B1.5 completes, you will be guaranteed to read C at time 1.5.

          It may be easier to think of RAMP as providing the ability to take each of your normal reads and writes under LWW and turn them into multi-column, multi-table writes that are all going to be visible/reflected in the table state (once completed). There's no special ordering guarantees beyond what Cassandra already provides; if you need strong ordering guarantees (e.g., enforcing sequential assignment of timestamps), it's a case for CAS.

          Show
          pbailis Peter Bailis added a comment - Jeremiah Jordan Good question. The short answer is that this behavior (reading A @2 and C@1) is well-defined under RAMP. Just like in Cassandra today, the fact that I read a write at time 500 doesn't mean I'm going to see the effects of all writes that occur before time 500! Rather, the guarantee that RAMP adds is that, once you see the effects of one write in the the batch, you'll see all of the writes in the batch. So, in your scenario, you have three batches: B1 {A=1, B=1} at time 1, B1.5 {B=1.5, C=1.5} at time 1.5, and B2 {A=2, B=2} at time 2. You could get the behavior you describe above if B1 executes and completes, B2 executes and complete, and we subsequently read sometime before B1.5 completes. So, I guess I disagree that "the real C you should be getting is the one from [the batch at time 1.5] " because you didn't yet see the effect of any writes from B1.5. However, once B1.5 completes, you will be guaranteed to read C at time 1.5. It may be easier to think of RAMP as providing the ability to take each of your normal reads and writes under LWW and turn them into multi-column, multi-table writes that are all going to be visible/reflected in the table state (once completed). There's no special ordering guarantees beyond what Cassandra already provides; if you need strong ordering guarantees (e.g., enforcing sequential assignment of timestamps), it's a case for CAS.
          Hide
          jbellis Jonathan Ellis added a comment -

          Cross table consistent reads are of fundamental importance.

          Maybe so, but I'm with Aleksey that we don't need to invent extra syntax to support "batch reads" right away. We can certainly add it later when we have a better understanding of the use cases involved.

          Show
          jbellis Jonathan Ellis added a comment - Cross table consistent reads are of fundamental importance. Maybe so, but I'm with Aleksey that we don't need to invent extra syntax to support "batch reads" right away. We can certainly add it later when we have a better understanding of the use cases involved.
          Hide
          tupshin Tupshin Harper added a comment -

          I am absolutely fine with vetting it as part another feature (indexes) before exposing new API to provide explicit support for RAMP transactions. I'm simply refuting the "hypothetical things we could do that nobody asked us for" part. Just because nobody thought to ask for this specific form of consistency doesn't mean the practical benefits are at all unclear.

          Show
          tupshin Tupshin Harper added a comment - I am absolutely fine with vetting it as part another feature (indexes) before exposing new API to provide explicit support for RAMP transactions. I'm simply refuting the "hypothetical things we could do that nobody asked us for" part. Just because nobody thought to ask for this specific form of consistency doesn't mean the practical benefits are at all unclear.
          Hide
          tupshin Tupshin Harper added a comment -

          I also want to point out that Aleksey Yeschenko's response to global indexes (CASSANDRA-6477) was: "I think we should leave it to people's client code. We don't need more complexity on our read/write paths when this can be done client-side."

          That combined with "alternatively, we just don't invent new unnecessary concepts (batch reads) to justify hypothetical things we could do that nobody asked us for" would leave us with absolutely no approach to achieve consistent cross-partition consistent indexes through either client or server-side code.

          Show
          tupshin Tupshin Harper added a comment - I also want to point out that Aleksey Yeschenko 's response to global indexes ( CASSANDRA-6477 ) was: "I think we should leave it to people's client code. We don't need more complexity on our read/write paths when this can be done client-side." That combined with "alternatively, we just don't invent new unnecessary concepts (batch reads) to justify hypothetical things we could do that nobody asked us for" would leave us with absolutely no approach to achieve consistent cross-partition consistent indexes through either client or server-side code.
          Hide
          pmcfadin Patrick McFadin added a comment -

          I don't get how cross partition consistent reads are something seen as edge case. I feel this is the primary use case. I've passed this by several users and got some measurable excitement.

          Show
          pmcfadin Patrick McFadin added a comment - I don't get how cross partition consistent reads are something seen as edge case. I feel this is the primary use case. I've passed this by several users and got some measurable excitement.
          Hide
          benedict Benedict added a comment -

          I can say that, from the point of view of a prior target consumer, the addition of cross-cluster consistent reads would have been exciting for me.

          On implementation details, thinking more from the point of view of my prior self, I would love to see this support streamed batches of arbitrary size. By which I mean I would have liked to start a write transaction, stream arbitrary amounts of data, and have it commit with complete isolation or not. To this end, I'm leaning towards writing the data straight into the memtables, but maintain a separate set of "uncommitted" transaction ids, which can be filtered out at read time. If a record is overwritten either before or after it is committed, it is moved to the read-buffer. I doubt this will be dramatically more complex, but the approach to implementation is fundamentally different. It seems to me supporting transactions of arbitrary size is an equally powerful win to consistent transactions.

          Show
          benedict Benedict added a comment - I can say that, from the point of view of a prior target consumer, the addition of cross-cluster consistent reads would have been exciting for me. On implementation details, thinking more from the point of view of my prior self, I would love to see this support streamed batches of arbitrary size. By which I mean I would have liked to start a write transaction, stream arbitrary amounts of data, and have it commit with complete isolation or not. To this end, I'm leaning towards writing the data straight into the memtables, but maintain a separate set of "uncommitted" transaction ids, which can be filtered out at read time. If a record is overwritten either before or after it is committed, it is moved to the read-buffer. I doubt this will be dramatically more complex, but the approach to implementation is fundamentally different. It seems to me supporting transactions of arbitrary size is an equally powerful win to consistent transactions.
          Hide
          benedict Benedict added a comment -

          Another separate point to consider, as a follow up: RAMP transactions may also permit us to provide consistent reads with less than QUORUM nodes involved. If we are performing a consistent read with a known transaction id, we only need to ensure the node has seen the totality of that transaction (i.e. any bulk insert has completed its first round, but not necessarily its second (commit) round) to be certain we have all of the data we need to answer the query correctly. So we can potentially answer QUORUM queries at the coordinator only. Note this only works if the coordinator has seen exactly this transaction id, though some similar optimisations are likely possible to expand that.

          I can envisage answering multiple queries with the following scheme:

          1) start transaction, by asking for the latest transaction_id from a given coordinator for the data we are interested in;
          2) query all coordinators directly for the regions they own, providing them with the transaction_id

          All of those that were updated with the given transaction_id have the potential to be answered with only the coordinator's involvement

          Further, to outline a sketch client-side API, I would suggest something like:

          Txn txn = client.begin()
          Future<ResultSet> rsf1 = txn.execute(stmt1);
          Future<ResultSet> rsf2 = txn.execute(stmt2);
          ...
          txn.execute();
          ResultSet rs1 = rsf1.get();
          ...

          Show
          benedict Benedict added a comment - Another separate point to consider, as a follow up: RAMP transactions may also permit us to provide consistent reads with less than QUORUM nodes involved. If we are performing a consistent read with a known transaction id, we only need to ensure the node has seen the totality of that transaction (i.e. any bulk insert has completed its first round, but not necessarily its second (commit) round) to be certain we have all of the data we need to answer the query correctly. So we can potentially answer QUORUM queries at the coordinator only. Note this only works if the coordinator has seen exactly this transaction id, though some similar optimisations are likely possible to expand that. I can envisage answering multiple queries with the following scheme: 1) start transaction, by asking for the latest transaction_id from a given coordinator for the data we are interested in; 2) query all coordinators directly for the regions they own, providing them with the transaction_id All of those that were updated with the given transaction_id have the potential to be answered with only the coordinator's involvement Further, to outline a sketch client-side API, I would suggest something like: Txn txn = client.begin() Future<ResultSet> rsf1 = txn.execute(stmt1); Future<ResultSet> rsf2 = txn.execute(stmt2); ... txn.execute(); ResultSet rs1 = rsf1.get(); ...
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I also want to point out that Aleksey Yeschenko's response to global indexes (CASSANDRA-6477) was: "I think we should leave it to people's client code. We don't need more complexity on our read/write paths when this can be done client-side."

          That combined with "alternatively, we just don't invent new unnecessary concepts (batch reads) to justify hypothetical things we could do that nobody asked us for" would leave us with absolutely no approach to achieve consistent cross-partition consistent indexes through either client or server-side code.

          I'm okay with global indexes now (and it doesn't matter, really, because they are happening either way), so this is a non-argument.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I also want to point out that Aleksey Yeschenko's response to global indexes ( CASSANDRA-6477 ) was: "I think we should leave it to people's client code. We don't need more complexity on our read/write paths when this can be done client-side." That combined with "alternatively, we just don't invent new unnecessary concepts (batch reads) to justify hypothetical things we could do that nobody asked us for" would leave us with absolutely no approach to achieve consistent cross-partition consistent indexes through either client or server-side code. I'm okay with global indexes now (and it doesn't matter, really, because they are happening either way), so this is a non-argument.
          Hide
          pbailis Peter Bailis added a comment -

          I doubt this will be dramatically more complex, but the approach to implementation is fundamentally different. It seems to me supporting transactions of arbitrary size is an equally powerful win to consistent transactions.

          I agree "streaming" batches could be really useful. In effect, you're turning an operation you'd have to perform client-side (e.g., you can simulate "streaming" by simply buffering your write sets and then calling one big BATCH) into a server-assisted one (where your proposed read-buffer/memtable stores the pending inserts while you're still deciding what goes into the transaction). From the RAMP perspective, this doesn't change things substantially – you just have to make sure to propagate the appropriate txn metadata after you've determined what writes made it into the batch.

          Benedict: towards your point on non-QUORUM but QUORUM reads, I agree there are some cool tricks to play. There's some additional complexity in these optimizations, but, the basic observation is a good one: if I already have a transaction ID I want to read from and the metadata associated with it, all I have to do is find the matching versions which don't necessarily require QUORUM reads for "consistency" w.r.t. the ID.

          Show
          pbailis Peter Bailis added a comment - I doubt this will be dramatically more complex, but the approach to implementation is fundamentally different. It seems to me supporting transactions of arbitrary size is an equally powerful win to consistent transactions. I agree "streaming" batches could be really useful. In effect, you're turning an operation you'd have to perform client-side (e.g., you can simulate "streaming" by simply buffering your write sets and then calling one big BATCH) into a server-assisted one (where your proposed read-buffer/memtable stores the pending inserts while you're still deciding what goes into the transaction). From the RAMP perspective, this doesn't change things substantially – you just have to make sure to propagate the appropriate txn metadata after you've determined what writes made it into the batch. Benedict : towards your point on non-QUORUM but QUORUM reads, I agree there are some cool tricks to play. There's some additional complexity in these optimizations, but, the basic observation is a good one: if I already have a transaction ID I want to read from and the metadata associated with it, all I have to do is find the matching versions which don't necessarily require QUORUM reads for "consistency" w.r.t. the ID.
          Hide
          mbroecheler Matthias Broecheler added a comment -

          Regarding use cases for this feature, it would be highly useful for TitanDB (http://titan.thinkaurelius.com/). Titan denormalizes the data and maintains a number of 2i in order to expose a graph data model that supports efficient querying. We are seeing a number of use cases in health and finance where having atomic visibility is a requirement to avoid phenomena like "phantom vertices" and "half-edges".

          Titan already supports the notion of a transaction and so I experimented with some naive/limited approaches for building this on top of C*. While RAMP is much more sophisticated and better thought through, here is what I learned in case it helps (ignoring deletes). Appending a lot of meta-data to columns had a pretty dramatic performance impact because Titan creates a lot of cells (wide rows). If you implement this in C* natively that wouldn't need to be returned to the client, but it would still bloat all data structures. More importantly, however, that overhead is always there and cannot be configured on a per transaction basis. In our cases there is a mixture of transactions few of which require the atomicity and most of which don't. My guess would be that for RAMP-Fast with linear transaction size storage overhead similar issues would arise for databases with lots of small cells and large tx.

          Appending a unique transaction id (Titan assigns those) and maintaining a transaction log (we needed that anyway for a different reason) has little impact on the normal transactions whereas atomic read transactions paid extra read penalties. In spirit, that seems similar to RAMP-Small. To me, this approach is more desirable because the (big) performance penalty only applies to those transactions that need it.

          Again, these experiences are based on a different/naive implementation and with a particular work load consisting of many small cells.

          Show
          mbroecheler Matthias Broecheler added a comment - Regarding use cases for this feature, it would be highly useful for TitanDB ( http://titan.thinkaurelius.com/ ). Titan denormalizes the data and maintains a number of 2i in order to expose a graph data model that supports efficient querying. We are seeing a number of use cases in health and finance where having atomic visibility is a requirement to avoid phenomena like "phantom vertices" and "half-edges". Titan already supports the notion of a transaction and so I experimented with some naive/limited approaches for building this on top of C*. While RAMP is much more sophisticated and better thought through, here is what I learned in case it helps (ignoring deletes). Appending a lot of meta-data to columns had a pretty dramatic performance impact because Titan creates a lot of cells (wide rows). If you implement this in C* natively that wouldn't need to be returned to the client, but it would still bloat all data structures. More importantly, however, that overhead is always there and cannot be configured on a per transaction basis. In our cases there is a mixture of transactions few of which require the atomicity and most of which don't. My guess would be that for RAMP-Fast with linear transaction size storage overhead similar issues would arise for databases with lots of small cells and large tx. Appending a unique transaction id (Titan assigns those) and maintaining a transaction log (we needed that anyway for a different reason) has little impact on the normal transactions whereas atomic read transactions paid extra read penalties. In spirit, that seems similar to RAMP-Small. To me, this approach is more desirable because the (big) performance penalty only applies to those transactions that need it. Again, these experiences are based on a different/naive implementation and with a particular work load consisting of many small cells.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Linking to https://issues.apache.org/jira/browse/CASSANDRA-7489?focusedCommentId=14053825&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053825 to not repeat myself.

          TL;DR - logged batches are playing a different role than RAMP transactions are supposed to, so one can't replace another.

          This doesn't mean that we can't add RAMP transactions as a separate feature, of course. What I'm arguing for is that RAMP-powered global indexes and RAMP-powered materialized views do cover the vast majority of cases where you'd 'manually' use RAMP otherwise, and we shouldn't expose them as a standalone thing just because we can.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Linking to https://issues.apache.org/jira/browse/CASSANDRA-7489?focusedCommentId=14053825&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053825 to not repeat myself. TL;DR - logged batches are playing a different role than RAMP transactions are supposed to, so one can't replace another. This doesn't mean that we can't add RAMP transactions as a separate feature, of course. What I'm arguing for is that RAMP-powered global indexes and RAMP-powered materialized views do cover the vast majority of cases where you'd 'manually' use RAMP otherwise, and we shouldn't expose them as a standalone thing just because we can.
          Hide
          benedict Benedict added a comment -

          logged batches are playing a different role than RAMP transactions are supposed to, so one can't replace another.

          Isn't the point of logged batches here to ensure that the trigger write succeeds if the write generating it does? In which case can we not upgrade a trigger-involving-write to all be a part of the same RAMP transaction? That either commits or not as one.

          Show
          benedict Benedict added a comment - logged batches are playing a different role than RAMP transactions are supposed to, so one can't replace another. Isn't the point of logged batches here to ensure that the trigger write succeeds if the write generating it does? In which case can we not upgrade a trigger-involving-write to all be a part of the same RAMP transaction? That either commits or not as one.
          Hide
          jjordan Jeremiah Jordan added a comment -

          What I'm arguing for is that RAMP-powered global indexes and RAMP-powered materialized views do cover the vast majority of cases where you'd 'manually' use RAMP otherwise, and we shouldn't expose them as a standalone thing just because we can.

          +1 I think we should start with adding RAMP to be used internally. Then once we have a better idea of what is going on with it, we can explore exposing it to the end user in a sane way.

          Show
          jjordan Jeremiah Jordan added a comment - What I'm arguing for is that RAMP-powered global indexes and RAMP-powered materialized views do cover the vast majority of cases where you'd 'manually' use RAMP otherwise, and we shouldn't expose them as a standalone thing just because we can. +1 I think we should start with adding RAMP to be used internally. Then once we have a better idea of what is going on with it, we can explore exposing it to the end user in a sane way.
          Hide
          benedict Benedict added a comment -

          once we have a better idea of what is going on with it, we can explore exposing it to the end user in a sane way.

          I don't think anybody is suggesting we introduce a RAMP read transaction as part of this ticket (just that we bear it in mind for future). However replacing logged batches with RAMP transactions is absolutely the most sensible place to start in my book.

          Show
          benedict Benedict added a comment - once we have a better idea of what is going on with it, we can explore exposing it to the end user in a sane way. I don't think anybody is suggesting we introduce a RAMP read transaction as part of this ticket (just that we bear it in mind for future). However replacing logged batches with RAMP transactions is absolutely the most sensible place to start in my book.
          Hide
          jjordan Jeremiah Jordan added a comment -

          AFAICT doing:

          However replacing logged batches with RAMP transactions is absolutely the most sensible place to start in my book.

          Is going to require doing exactly this:

          I don't think anybody is suggesting we introduce a RAMP read transaction as part of this ticket (just that we bear it in mind for future).

          Show
          jjordan Jeremiah Jordan added a comment - AFAICT doing: However replacing logged batches with RAMP transactions is absolutely the most sensible place to start in my book. Is going to require doing exactly this: I don't think anybody is suggesting we introduce a RAMP read transaction as part of this ticket (just that we bear it in mind for future).
          Hide
          benedict Benedict added a comment -

          I don't see why?

          Show
          benedict Benedict added a comment - I don't see why?
          Hide
          jbellis Jonathan Ellis added a comment -

          Because without read support RAMP doesn't actually give you anything more than logged batches. If I update tables X and Y in the same RAMP batch, then I do concurrent queries against X and Y, I have no isolation guarantees because the coordinator doesn't know to check both places.

          But, I think we can add read support pretty easily. "Just" extend BATCH syntax to allow SELECT. JDBC has had the concept of multiple resultsets for ages.

          Show
          jbellis Jonathan Ellis added a comment - Because without read support RAMP doesn't actually give you anything more than logged batches. If I update tables X and Y in the same RAMP batch, then I do concurrent queries against X and Y, I have no isolation guarantees because the coordinator doesn't know to check both places. But, I think we can add read support pretty easily. "Just" extend BATCH syntax to allow SELECT. JDBC has had the concept of multiple resultsets for ages.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          TL;DR - logged batches are playing a different role than RAMP transactions are supposed to, so one can't replace another.

          But RAMP transactions (with logged commit so we don't need to rely purely on "read repair" for coordinator failure mid-commit) give you a superset of logged batches. So I don't see a need to introduce new syntax for the former.

          Show
          jbellis Jonathan Ellis added a comment - - edited TL;DR - logged batches are playing a different role than RAMP transactions are supposed to, so one can't replace another. But RAMP transactions (with logged commit so we don't need to rely purely on "read repair" for coordinator failure mid-commit) give you a superset of logged batches. So I don't see a need to introduce new syntax for the former.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          (with logged commit so we don't need to rely purely on "read repair" for coordinator failure mid-commit)

          Right, with logged commit. I was assuming (hoping, really), that this would replace what we currently have, not merely extend it. But, yes, with logged commit we can "replace" the batchlog for the purposes of triggers.

          I'd also vote for making UNLOGGED the default (implicit) BATCH behavior, now that the LOGGED batches would cost even more than they do now.

          Show
          iamaleksey Aleksey Yeschenko added a comment - (with logged commit so we don't need to rely purely on "read repair" for coordinator failure mid-commit) Right, with logged commit. I was assuming (hoping, really), that this would replace what we currently have, not merely extend it. But, yes, with logged commit we can "replace" the batchlog for the purposes of triggers. I'd also vote for making UNLOGGED the default (implicit) BATCH behavior, now that the LOGGED batches would cost even more than they do now.
          Hide
          jbellis Jonathan Ellis added a comment -

          I'd also vote for making UNLOGGED the default (implicit) BATCH behavior, now that the LOGGED batches would cost even more than they do now.

          UNLOGGED is still a misfeature, so I don't see how the cost of RAMP affects our choice of default. (And for the record I think RAMP should definitely be the default; it matches users' assumptions so much better.)

          I guess we could add UN_ISOLATED to request logged-without-ramp though.

          Show
          jbellis Jonathan Ellis added a comment - I'd also vote for making UNLOGGED the default (implicit) BATCH behavior, now that the LOGGED batches would cost even more than they do now. UNLOGGED is still a misfeature, so I don't see how the cost of RAMP affects our choice of default. (And for the record I think RAMP should definitely be the default; it matches users' assumptions so much better.) I guess we could add UN_ISOLATED to request logged-without-ramp though.
          Hide
          jjordan Jeremiah Jordan added a comment -

          UNLOGGED is still a misfeature

          UNLOGGED is not always a misfeature. If I was doing batch writes to a single partition, I would make them unlogged. No point in having the overhead of a logged batch for that. But I would not make UNLOGGED the default.

          Show
          jjordan Jeremiah Jordan added a comment - UNLOGGED is still a misfeature UNLOGGED is not always a misfeature. If I was doing batch writes to a single partition, I would make them unlogged. No point in having the overhead of a logged batch for that. But I would not make UNLOGGED the default.
          Hide
          jbellis Jonathan Ellis added a comment -

          FTR, we transform single-partition batches to UNLOGGED automagically, since you are right; there is no point in the logging overhead there.

          Show
          jbellis Jonathan Ellis added a comment - FTR, we transform single-partition batches to UNLOGGED automagically, since you are right; there is no point in the logging overhead there.
          Hide
          jbellis Jonathan Ellis added a comment -

          There really isn't much of a use case for unlogged batches now that we have async drivers. So I'd rather keep logged/ramp the default.

          Show
          jbellis Jonathan Ellis added a comment - There really isn't much of a use case for unlogged batches now that we have async drivers. So I'd rather keep logged/ramp the default.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          There really isn't much of a use case for unlogged batches now that we have async drivers. So I'd rather keep logged/ramp the default.

          Good point. Yes, you are right.

          Show
          iamaleksey Aleksey Yeschenko added a comment - There really isn't much of a use case for unlogged batches now that we have async drivers. So I'd rather keep logged/ramp the default. Good point. Yes, you are right.
          Hide
          tjake T Jake Luciani added a comment -

          I've been thinking about how to implement this and a couple ideas come to mind:

          • We would use the existing batchlog and use this as the prepare pass of the transaction (RAMP-Fast)
          • Since we will use TimeUUID as the timestamp we can also use this for the batchlog id
          • We add a way to find and read from the batchlog for a given batchlog id.
          • If the coordinator gets the results from two partitions and the timeuuids don't match it would read the later timeuuid from the batchlog and fix the data.

          Some concerns:

          • Let's assume we query from partition A and B, and we see the results don't match timestamps, we would pull the latest batchlog assuming they are from the same batch but let's say they in fact are not. In this case we wasted a lot of time so my question is should we only do this in the user supplies a new CL type? I think Peter was suggesting this in his preso READ_ATOMIC.
          • In the case of a global index we plan on reading the data after reading the index. The data query might reveal the indexed value is stale. We would need to apply the batchlog and fix the index, would we then restart the entire query? or maybe overquery assuming some index values will be stale? Either way this query looks different than the above scenario.
          Show
          tjake T Jake Luciani added a comment - I've been thinking about how to implement this and a couple ideas come to mind: We would use the existing batchlog and use this as the prepare pass of the transaction (RAMP-Fast) Since we will use TimeUUID as the timestamp we can also use this for the batchlog id We add a way to find and read from the batchlog for a given batchlog id. If the coordinator gets the results from two partitions and the timeuuids don't match it would read the later timeuuid from the batchlog and fix the data. Some concerns: Let's assume we query from partition A and B, and we see the results don't match timestamps, we would pull the latest batchlog assuming they are from the same batch but let's say they in fact are not. In this case we wasted a lot of time so my question is should we only do this in the user supplies a new CL type? I think Peter was suggesting this in his preso READ_ATOMIC. In the case of a global index we plan on reading the data after reading the index. The data query might reveal the indexed value is stale. We would need to apply the batchlog and fix the index, would we then restart the entire query? or maybe overquery assuming some index values will be stale? Either way this query looks different than the above scenario.
          Hide
          pbailis Peter Bailis added a comment -

          Let's assume we query from partition A and B, and we see the results don't match timestamps, we would pull the latest batchlog assuming they are from the same batch but let's say they in fact are not. In this case we wasted a lot of time so my question is should we only do this in the user supplies a new CL type?

          If you set the same, unique (e.g., UUID) write timestamp for all writes in a batch, then you know that any results with different timestamps are part of different batches. So, given mismatched timestamps, should you check the batchlog for pending writes? One solution is to always check (as in RAMP-Small). This doesn't require any extra metadata, but, as you point out, also requires 2 RTTs. To cut down on these RTTs, you could also do attach a Bloom filter of the items in each batch and only check any possibly missing writes (as in RAMP-Hybrid). (I can go into more detail if you want.) However, I agree that you might not want to pay these costs all of the time for reads. Would a BATCH_READ or other modifier to CQL SELECT statements make sense?

          In the case of a global index we plan on reading the data after reading the index. The data query might reveal the indexed value is stale. We would need to apply the batchlog and fix the index, would we then restart the entire query? or maybe overquery assuming some index values will be stale? Either way this query looks different than the above scenario.

          I think there are a few options. The easiest is to simply filter out the out of date rows, and then you are guaranteed to see a subset of the index entries. Alternatively, you could provide a "snapshot index read" where you read the older, overwritten values from the data node. If you want a "read latest and read snapshot" mode, there are some options I can describe, but they generally entail either more metadata or, otherwise, using locks/blocking coordination, which I don't think you want.

          Show
          pbailis Peter Bailis added a comment - Let's assume we query from partition A and B, and we see the results don't match timestamps, we would pull the latest batchlog assuming they are from the same batch but let's say they in fact are not. In this case we wasted a lot of time so my question is should we only do this in the user supplies a new CL type? If you set the same, unique (e.g., UUID) write timestamp for all writes in a batch, then you know that any results with different timestamps are part of different batches. So, given mismatched timestamps, should you check the batchlog for pending writes? One solution is to always check (as in RAMP-Small). This doesn't require any extra metadata, but, as you point out, also requires 2 RTTs. To cut down on these RTTs, you could also do attach a Bloom filter of the items in each batch and only check any possibly missing writes (as in RAMP-Hybrid). (I can go into more detail if you want.) However, I agree that you might not want to pay these costs all of the time for reads. Would a BATCH_READ or other modifier to CQL SELECT statements make sense? In the case of a global index we plan on reading the data after reading the index. The data query might reveal the indexed value is stale. We would need to apply the batchlog and fix the index, would we then restart the entire query? or maybe overquery assuming some index values will be stale? Either way this query looks different than the above scenario. I think there are a few options. The easiest is to simply filter out the out of date rows, and then you are guaranteed to see a subset of the index entries. Alternatively, you could provide a "snapshot index read" where you read the older, overwritten values from the data node. If you want a "read latest and read snapshot" mode, there are some options I can describe, but they generally entail either more metadata or, otherwise, using locks/blocking coordination, which I don't think you want.
          Hide
          cscetbon Cyril Scetbon added a comment -

          What's the current status of this ticket ? Won't implement ?

          Show
          cscetbon Cyril Scetbon added a comment - What's the current status of this ticket ? Won't implement ?
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          What's the current status of this ticket? Won't implement?

          There is a dependency on having truly unique timestamps, a separate ticket - CASSANDRA-7919 - that is currently blocking this ticket.

          Show
          iamaleksey Aleksey Yeschenko added a comment - What's the current status of this ticket? Won't implement? There is a dependency on having truly unique timestamps, a separate ticket - CASSANDRA-7919 - that is currently blocking this ticket.
          Hide
          cscetbon Cyril Scetbon added a comment -
          Show
          cscetbon Cyril Scetbon added a comment - Aleksey Yeschenko Thanks
          Hide
          michaelsembwever mck added a comment -

          Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.
            ref https://s.apache.org/EHBy

          Show
          michaelsembwever mck added a comment - Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.   ref https://s.apache.org/EHBy

            People

            • Assignee:
              Unassigned
              Reporter:
              tupshin Tupshin Harper
            • Votes:
              15 Vote for this issue
              Watchers:
              60 Start watching this issue

              Dates

              • Created:
                Updated:

                Development