Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      Supporting entity groups similar to App Engine's (that is, allow rows to be part of a parent "entity group," whose key is used for routing instead of the row itself) allows several improvements:

      • batches within an EG can be atomic across multiple rows
      • order-by-value queries within an EG only have to touch a single replica even with RandomPartitioner

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          Should we add a special "row group" api?

          I really like how the composite PK model is shaking out over in CASSANDRA-2474. Feels like that's the right model for this too, conceptually. Implementation-wise, wide rows still come up short as noted above.

          I'm starting to think that the right way to implement this is to start to erase the distinction between row and column lookups. Which is basically where Stu was going in CASSANDRA-674.

          Show
          Jonathan Ellis added a comment - Should we add a special "row group" api? I really like how the composite PK model is shaking out over in CASSANDRA-2474 . Feels like that's the right model for this too, conceptually. Implementation-wise, wide rows still come up short as noted above. I'm starting to think that the right way to implement this is to start to erase the distinction between row and column lookups. Which is basically where Stu was going in CASSANDRA-674 .
          Hide
          Jonathan Ellis added a comment -

          if there were more optimizations done on rows (allowed them to be even larger, etc.), would that be a better approach?

          I think it would be. That's definitely a long-term play, though. I only have ideas on how to fix some of the problems Sylvain raised. And then there's others like CASSANDRA-3362.

          But we kind of need to fix large rows independent of the entity group idea.

          Two use cases where same row does not work for us:

          Both of these sound like basically workarounds for weaknesses elsewhere. Which again feels like the right answer is to fix those weaknesses rather than adding another layer of hack on top.

          I guess there's really two questions here:

          • Should we add a special "row group" api?
          • What should the implementation look like?

          In other words, we could add a row group api and implement it in terms of large rows. Or implement it another way. But, we want wide rows that work "well" independent of row groups, so it feels like that's the right place to spend our efforts now.

          Show
          Jonathan Ellis added a comment - if there were more optimizations done on rows (allowed them to be even larger, etc.), would that be a better approach? I think it would be. That's definitely a long-term play, though. I only have ideas on how to fix some of the problems Sylvain raised. And then there's others like CASSANDRA-3362 . But we kind of need to fix large rows independent of the entity group idea. Two use cases where same row does not work for us: Both of these sound like basically workarounds for weaknesses elsewhere. Which again feels like the right answer is to fix those weaknesses rather than adding another layer of hack on top. I guess there's really two questions here: Should we add a special "row group" api? What should the implementation look like? In other words, we could add a row group api and implement it in terms of large rows. Or implement it another way. But, we want wide rows that work "well" independent of row groups, so it feels like that's the right place to spend our efforts now.
          Hide
          Daniel Doubleday added a comment -

          Two use cases where same row does not work for us:

          • Read/Write intense CFs where we need row caching but cannot cache all values due to their size (CASSANDRA-1956 in its current form will not help there)
          • Heavy update CFs where we use changing (versioned) row keys to avoid multiple-sstable-reads
          Show
          Daniel Doubleday added a comment - Two use cases where same row does not work for us: Read/Write intense CFs where we need row caching but cannot cache all values due to their size ( CASSANDRA-1956 in its current form will not help there) Heavy update CFs where we use changing (versioned) row keys to avoid multiple-sstable-reads
          Hide
          T Jake Luciani added a comment -

          Do we really need row groups now that we can have arbitrary nesting within a row via composite columns?

          What about secondary indexes? Unless we add composite secondary indexes.

          Show
          T Jake Luciani added a comment - Do we really need row groups now that we can have arbitrary nesting within a row via composite columns? What about secondary indexes? Unless we add composite secondary indexes.
          Hide
          Ed Anuff added a comment -

          I agree with Sylvain points. This does raise the question, though, if there were more optimizations done on rows (allowed them to be even larger, etc.), would that be a better approach? I'm personally all for that.

          Show
          Ed Anuff added a comment - I agree with Sylvain points. This does raise the question, though, if there were more optimizations done on rows (allowed them to be even larger, etc.), would that be a better approach? I'm personally all for that.
          Hide
          Sylvain Lebresne added a comment -

          It is a good question, and I suppose it depends on what was the motivation for row groups in the first place (after all, we've always kind of be able to arbitrary nest, we just have (slightly) simpler way now).

          For instance, if the goal is to make sure rows are collocated, having to do it with composite may not be very convenient, in particular if you wan to collocate rows across multiple CF. Of course it is always possible to redesign the model so that you use the same row key and use composite, but that could be really weird. To "solve" that last part, we could provide the row group API but encode it server side with composites.

          However, I think we should be aware that pushing such encoding has limitation today:

          • there is the same problem that encoding super columns with composite, i.e. we'd need range tombstones.
          • rows have a number of subtle limitation that are fine, but may be a bit less fine if you start to push for collocating lots and lots of data under one row:
            • There is the 2B columns limit
            • If a row is > 2GB, it won't be mmapped
            • compaction is slower on big rows
            • performance can globally be less good on huge rows
            • leveled compaction has at least one row per sstable. Goes a bit against fixed size sstables.
              Don't get me wrong, for most case, this is probably fine and we likely want to improve on all of this, but those are still obstacle to co-locating large amount of data under the same row

          Now maybe pushing the co-location of data is not a good idea for a distributed store (it obviously raise the question of load balancing in particular), but there is case where careful co-location is paramount to the best performance so giving a good tool for that could have value.

          Doing row groups 'natively' would avoids the gotcha above but note that it has at least one drawback: if/once we do CASSANDRA-2893, isolation for row group encoded with composite type would be a given, with 'native' row group we would have to work a bit.

          So overall, I think row group could have an interest API wise, making for a number of more natural modeling. And if we think this is indeed useful, I kind of think doing it natively could be less headache than an encoding with composites overall.

          Show
          Sylvain Lebresne added a comment - It is a good question, and I suppose it depends on what was the motivation for row groups in the first place (after all, we've always kind of be able to arbitrary nest, we just have (slightly) simpler way now). For instance, if the goal is to make sure rows are collocated, having to do it with composite may not be very convenient, in particular if you wan to collocate rows across multiple CF. Of course it is always possible to redesign the model so that you use the same row key and use composite, but that could be really weird. To "solve" that last part, we could provide the row group API but encode it server side with composites. However, I think we should be aware that pushing such encoding has limitation today: there is the same problem that encoding super columns with composite, i.e. we'd need range tombstones. rows have a number of subtle limitation that are fine, but may be a bit less fine if you start to push for collocating lots and lots of data under one row: There is the 2B columns limit If a row is > 2GB, it won't be mmapped compaction is slower on big rows performance can globally be less good on huge rows leveled compaction has at least one row per sstable. Goes a bit against fixed size sstables. Don't get me wrong, for most case, this is probably fine and we likely want to improve on all of this, but those are still obstacle to co-locating large amount of data under the same row Now maybe pushing the co-location of data is not a good idea for a distributed store (it obviously raise the question of load balancing in particular), but there is case where careful co-location is paramount to the best performance so giving a good tool for that could have value. Doing row groups 'natively' would avoids the gotcha above but note that it has at least one drawback: if/once we do CASSANDRA-2893 , isolation for row group encoded with composite type would be a given, with 'native' row group we would have to work a bit. So overall, I think row group could have an interest API wise, making for a number of more natural modeling. And if we think this is indeed useful, I kind of think doing it natively could be less headache than an encoding with composites overall.
          Hide
          Jonathan Ellis added a comment -

          Do we really need row groups now that we can have arbitrary nesting within a row via composite columns? Looked at that way the row key itself becomes the "entity group id."

          Show
          Jonathan Ellis added a comment - Do we really need row groups now that we can have arbitrary nesting within a row via composite columns? Looked at that way the row key itself becomes the "entity group id."
          Hide
          Patricio Echague added a comment -

          +1 for "Row Groups"

          Show
          Patricio Echague added a comment - +1 for "Row Groups"
          Hide
          Dave Revell added a comment -

          It sounds like everyone agrees.

          +1 on Sylvain's idea to call them something other than "entity groups."

          Show
          Dave Revell added a comment - It sounds like everyone agrees. +1 on Sylvain's idea to call them something other than "entity groups."
          Hide
          Sylvain Lebresne added a comment -

          I'd add that it's pretty clear in my mind that we should end up calling them 'row group' or something alike to avoid the confusion.

          Show
          Sylvain Lebresne added a comment - I'd add that it's pretty clear in my mind that we should end up calling them 'row group' or something alike to avoid the confusion.
          Hide
          Jonathan Ellis added a comment -

          By "like App Engine [megastore]" we only mean "atomic within a group" not "consistent and isolated within a group." the former is useful even without the latter.

          Show
          Jonathan Ellis added a comment - By "like App Engine [megastore] " we only mean "atomic within a group" not "consistent and isolated within a group." the former is useful even without the latter.
          Hide
          Dave Revell added a comment -

          As jbellis says in the description, atomic batches and entity group locality are cool and useful. But we should be clear that Cassandra's entity groups would be a different beast than Megastore's entity groups, and wouldn't have the same consistency properties unless some un-Cassandra-like changes were made.

          In Megastore, transactions can maintain arbitrary consistency constraints among items in an entity group, since there is a Paxos-agreed total order of transactions. Cassandra has so far avoided fancy distributed agreement like this. For example, imagine running (in Cassandra) two different transactions on two different replicas and imagine what mishmash of the two outcomes you'd get once timestamp-based conflict resolution happened. In Megastore one of the transactions would abort. Are we willing to add Paxos?

          G-Store's ownership transfer protocol also seems very anti-Cassandra-philosophy with its concept of single-replica item ownership.

          I'd be happy to be corrected on any of this. I think Megastore-like entity groups are an exciting idea but perhaps make more sense on top of HBase

          Show
          Dave Revell added a comment - As jbellis says in the description, atomic batches and entity group locality are cool and useful. But we should be clear that Cassandra's entity groups would be a different beast than Megastore's entity groups, and wouldn't have the same consistency properties unless some un-Cassandra-like changes were made. In Megastore, transactions can maintain arbitrary consistency constraints among items in an entity group, since there is a Paxos-agreed total order of transactions. Cassandra has so far avoided fancy distributed agreement like this. For example, imagine running (in Cassandra) two different transactions on two different replicas and imagine what mishmash of the two outcomes you'd get once timestamp-based conflict resolution happened. In Megastore one of the transactions would abort. Are we willing to add Paxos? G-Store's ownership transfer protocol also seems very anti-Cassandra-philosophy with its concept of single-replica item ownership. I'd be happy to be corrected on any of this. I think Megastore-like entity groups are an exciting idea but perhaps make more sense on top of HBase
          Hide
          Sylvain Lebresne added a comment -

          Do tokens have to be one-to-one unique with keys, or could you have multiple keys share the same token? (apparently that's currently possible, although an extreme edge case, with the RandomPartitioner)

          Right now, they do have to be one-to-one. That's the 'raison d'être' of CASSANDRA-1034 (and I won't hide that my interest for the latter is motivated by this ticket, even though we should fix it because of RandomPartioner anyway).

          As for this ticket, I think using parts of the key for the token is only the first step (but an important one). The main thing we want here is to apply mutation on an entity group consistently, that is in one commit log transaction. That in turn is not very complicated in theory, but will be much more work in practice I believe.

          As a side note, I think it would also be nice to find "a trick" to make this work with the existing partitioners. Otherwise, since we can't change partitioners, this would make this useful for only new clusters, which would be sad.

          Show
          Sylvain Lebresne added a comment - Do tokens have to be one-to-one unique with keys, or could you have multiple keys share the same token? (apparently that's currently possible, although an extreme edge case, with the RandomPartitioner) Right now, they do have to be one-to-one. That's the 'raison d'être' of CASSANDRA-1034 (and I won't hide that my interest for the latter is motivated by this ticket, even though we should fix it because of RandomPartioner anyway). As for this ticket, I think using parts of the key for the token is only the first step (but an important one). The main thing we want here is to apply mutation on an entity group consistently, that is in one commit log transaction. That in turn is not very complicated in theory, but will be much more work in practice I believe. As a side note, I think it would also be nice to find "a trick" to make this work with the existing partitioners. Otherwise, since we can't change partitioners, this would make this useful for only new clusters, which would be sad.
          Hide
          Ed Anuff added a comment -

          This is something I've been thinking about while consolidating the number of column families within an application so that I ended up with row keys that were constructed from concatenating an entity id with various other strings (eg. 9081bd70-3fe4-11e0-9207-0800200c9a66:something ). Is it feasible to have a partitioner that hashed on just the first x bytes in a key? Do tokens have to be one-to-one unique with keys, or could you have multiple keys share the same token? (apparently that's currently possible, although an extreme edge case, with the RandomPartitioner)

          Show
          Ed Anuff added a comment - This is something I've been thinking about while consolidating the number of column families within an application so that I ended up with row keys that were constructed from concatenating an entity id with various other strings (eg. 9081bd70-3fe4-11e0-9207-0800200c9a66:something ). Is it feasible to have a partitioner that hashed on just the first x bytes in a key? Do tokens have to be one-to-one unique with keys, or could you have multiple keys share the same token? (apparently that's currently possible, although an extreme edge case, with the RandomPartitioner)
          Hide
          Edward Ribeiro added a comment -

          CIDR 2011 Megastore paper: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf

          Any development already started on this issue?

          Show
          Edward Ribeiro added a comment - CIDR 2011 Megastore paper: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf Any development already started on this issue?
          Hide
          Gary Dusbabek added a comment -

          Would they be static like App Engine, or would we permit dynamically adding/subtracting existing rows to an entity group, in effect, moving them?

          the G-Store paper explains one approach to this: www.cs.ucsb.edu/~sudipto/papers/socc10-das.pdf

          Show
          Gary Dusbabek added a comment - Would they be static like App Engine, or would we permit dynamically adding/subtracting existing rows to an entity group, in effect, moving them? the G-Store paper explains one approach to this: www.cs.ucsb.edu/~sudipto/papers/socc10-das.pdf

            People

            • Assignee:
              Unassigned
              Reporter:
              Jonathan Ellis
            • Votes:
              11 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 80h
                80h
                Remaining:
                Remaining Estimate - 80h
                80h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development