Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: API
    • Labels:
      None

      Description

      CQL 3.0 currently has support for defining wide rows by declaring a composite primary key. For example:

      CREATE TABLE timeline (
          user_id varchar,
          tweet_id uuid,
          author varchar,
          body varchar,
          PRIMARY KEY (user_id, tweet_id)
      );
      

      It would also be useful to manage sharding a wide row through the cql schema. This would require being able to split up the actual row key in the schema definition. In the above example you might want to make the row key a combination of user_id and day_of_tweet, in order to shard timelines by day. This might look something like:

      CREATE TABLE timeline (
          user_id varchar,
          day_of_tweet date,
          tweet_id uuid,
          author varchar,
          body varchar,
          PRIMARY KEY (user_id REQUIRED, day_of_tweet REQUIRED, tweet_id)
      );
      

      Thats probably a terrible attempt at how to structure that in CQL. But I think I've gotten the point across. I tagged this for cql 3.0, but I'm honestly not sure how much work it might be. As far as I know built in support for composite keys is limited.

        Activity

        Hide
        Christoph Tavan added a comment -

        We're facing the row-key-sharding-problem as well and our workaround is concatenating and splitting text-type row keys in our application logic (who would not work around like that?). However, that feels somehow hacky given the fact that we can use true CompositeType for the column names, which is a huge step forward in CQL 3.

        Since C* allows row-keys of CompositeType I think it would be a nice feature to have them supported through CQL as well since it would remove this concatenation-logic from the application and put it into the data-model where it belongs.

        So +1 for some solution to this from my side.

        On the syntax maybe the discussion in CASSANDRA-4004 is also related. I, personally, think that adding an attribute to the fields that should be part of the row-key in the PRIMARY KEY() statement like you suggest would be fine.

        Show
        Christoph Tavan added a comment - We're facing the row-key-sharding-problem as well and our workaround is concatenating and splitting text-type row keys in our application logic (who would not work around like that?). However, that feels somehow hacky given the fact that we can use true CompositeType for the column names, which is a huge step forward in CQL 3. Since C* allows row-keys of CompositeType I think it would be a nice feature to have them supported through CQL as well since it would remove this concatenation-logic from the application and put it into the data-model where it belongs. So +1 for some solution to this from my side. On the syntax maybe the discussion in CASSANDRA-4004 is also related. I, personally, think that adding an attribute to the fields that should be part of the row-key in the PRIMARY KEY() statement like you suggest would be fine.
        Hide
        Sylvain Lebresne added a comment -

        I meant to open a similar ticket for some time now but forgot. I've actually created it as CASSANDRA-4179. It is also basically suggesting adding support for composites in row key. I however decided to open a separate ticket because:

        1. I didn't meant CASSANDRA-4179 to be specific to sharding specific and in particular discuss there the question of composite in column values.
        2. I think that adding a nice syntax for composite in the row key is indeed nice for sharding very wide rows, but I'm thinking maybe it could be worth going even further. What I mean here is that sharding a time series is very common so we could imagine making that sharding more automatic. For instance (and using a syntax on which I haven't given much though, but reusing one of my syntax suggestion from CASSANDRA-4179), we could have:
          CREATE TABLE timeline (
              user_id varchar,
              day_of_tweet date AUTO(day(tweet_id)),
              tweet_id uuid,
              author varchar,
              body varchar,
              GROUP (user_id, day_of_tweet) as key,
              PRIMARY KEY (key, tweet_id)
          );
          

          for which the semantic would be that the day_of_tweet would be automatically calculated from tweet_id.

        I'll admit it's a bit specific in a way, and clearly we could say we leave that to the client, but time series is a very very common use case for Cassandra and sharding rows is very often needed at some granularity so ...

        Anyway, my suggestion would be to keep the 'composites in row key' discussion in CASSANDRA-4179 and maybe discuss deeper support for row sharding here.

        Show
        Sylvain Lebresne added a comment - I meant to open a similar ticket for some time now but forgot. I've actually created it as CASSANDRA-4179 . It is also basically suggesting adding support for composites in row key. I however decided to open a separate ticket because: I didn't meant CASSANDRA-4179 to be specific to sharding specific and in particular discuss there the question of composite in column values. I think that adding a nice syntax for composite in the row key is indeed nice for sharding very wide rows, but I'm thinking maybe it could be worth going even further. What I mean here is that sharding a time series is very common so we could imagine making that sharding more automatic. For instance (and using a syntax on which I haven't given much though, but reusing one of my syntax suggestion from CASSANDRA-4179 ), we could have: CREATE TABLE timeline ( user_id varchar, day_of_tweet date AUTO(day(tweet_id)), tweet_id uuid, author varchar, body varchar, GROUP (user_id, day_of_tweet) as key, PRIMARY KEY (key, tweet_id) ); for which the semantic would be that the day_of_tweet would be automatically calculated from tweet_id. I'll admit it's a bit specific in a way, and clearly we could say we leave that to the client, but time series is a very very common use case for Cassandra and sharding rows is very often needed at some granularity so ... Anyway, my suggestion would be to keep the 'composites in row key' discussion in CASSANDRA-4179 and maybe discuss deeper support for row sharding here.
        Hide
        Christoph Tavan added a comment -

        I think the suggestion to discuss sharding and composite row keys/values in a different issue is good.

        If we wanna have event more specific/dedicated support for sharding, a different syntax idea for (borrowed from MySQL) might be:

        CREATE TABLE timeline (
            user_id varchar,
            tweet_id uuid,
            author varchar,
            body varchar,
            PRIMARY KEY (user_id, tweet_id)
        ) PARTITION BY HASH(user_id, DAY(tweet_id)) PARTITIONS 10;
        

        It would read very straightforward IMO, but maybe that's already too high-level?

        Show
        Christoph Tavan added a comment - I think the suggestion to discuss sharding and composite row keys/values in a different issue is good. If we wanna have event more specific/dedicated support for sharding, a different syntax idea for (borrowed from MySQL ) might be: CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar, PRIMARY KEY (user_id, tweet_id) ) PARTITION BY HASH(user_id, DAY(tweet_id)) PARTITIONS 10; It would read very straightforward IMO, but maybe that's already too high-level?
        Hide
        Ahmet AKYOL added a comment - - edited

        +1 on 'separation of concerns'. It has nothing to do with composite keys.

        On the other hand, as a C* user (a freeloader, not a talented committer like you guys ), I do not like the sound of any user side (like CQL 3) solution for sharding wide rows; because, with this approach, users have to think about "sharding" themselves for many of their CFs.

        The problem here is the row size again as in CASSANDRA-3929 and IMHO, the same solution (compaction strategy but maybe with some extras like chaining) can be used here. I'm sure, it may make things more complicated on Cassandra (internals) side. However, it's better for users.

        In fact, the real problem is the "very very wide rows" as mentioned. Partition by hash (or any other automatic way) may cause efficiency problems for the crowd ("not very wide rows") due to unnecessary sharding.

        So, please attack the "very very wide rows" problem and if possible, find a configurable (like "wide row sharding size hint: 10 MB") solution without CQL 3.

        P.S.: I'll also admit that, "automated sharding for time series" can be good enough for some use cases but not all of them. So, this issue still makes sense but not as "sharding very wide rows" but as "automated sharding for time series".

        Show
        Ahmet AKYOL added a comment - - edited +1 on 'separation of concerns'. It has nothing to do with composite keys. On the other hand, as a C* user (a freeloader, not a talented committer like you guys ), I do not like the sound of any user side (like CQL 3) solution for sharding wide rows; because, with this approach, users have to think about "sharding" themselves for many of their CFs. The problem here is the row size again as in CASSANDRA-3929 and IMHO, the same solution (compaction strategy but maybe with some extras like chaining) can be used here. I'm sure, it may make things more complicated on Cassandra (internals) side. However, it's better for users. In fact, the real problem is the "very very wide rows" as mentioned. Partition by hash (or any other automatic way) may cause efficiency problems for the crowd ("not very wide rows") due to unnecessary sharding. So, please attack the "very very wide rows" problem and if possible, find a configurable (like "wide row sharding size hint: 10 MB") solution without CQL 3. P.S.: I'll also admit that, "automated sharding for time series" can be good enough for some use cases but not all of them. So, this issue still makes sense but not as "sharding very wide rows" but as "automated sharding for time series".
        Hide
        Jonathan Ellis added a comment -

        CASSANDRA-4285 is a use case for this. I suggest implementing that "the hard way" first and then seeing what re-usable patterns we can extract.

        Show
        Jonathan Ellis added a comment - CASSANDRA-4285 is a use case for this. I suggest implementing that "the hard way" first and then seeing what re-usable patterns we can extract.
        Hide
        Jonathan Ellis added a comment -

        I think CASSANDRA-4179 solved the main problem here, the "I need a composite partition key" part.

        Let's close this and let actual use cases drive further enhancements. (I note that 4285 ended up not needing anything of the sort, contradicting my last comment above.)

        Show
        Jonathan Ellis added a comment - I think CASSANDRA-4179 solved the main problem here, the "I need a composite partition key" part. Let's close this and let actual use cases drive further enhancements. (I note that 4285 ended up not needing anything of the sort, contradicting my last comment above.)

          People

          • Assignee:
            Unassigned
            Reporter:
            Nick Bailey
          • Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development