Cassandra
  1. Cassandra
  2. CASSANDRA-3761 CQL 3.0
  3. CASSANDRA-4179

Add more general support for composites (to row key, column value)

    Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 1
    • Component/s: API
    • Labels:

      Description

      Currently CQL3 have a nice syntax for using composites in the column name (it's more than that in fact, it creates a whole new abstraction but let's say I'm talking implementation here). There is however 2 other place where composites could be used (again implementation wise): the row key and the column value. This ticket proposes to explore which of those make sense for CQL3 and how.

      For the row key, I really think that CQL support makes sense. It's very common (and useful) to want to stuff composite information in a row key. Sharding a time serie (CASSANDRA-4176) is probably the best example but there is other.

      For the column value it is less clear. CQL3 makes it very transparent and convenient to store multiple related values into multiple columns so maybe composites in a column value is much less needed. I do still see two cases for which it could be handy:

      1. to save some disk/memory space, if you do know it makes no sense to insert/read two value separatly.
      2. if you want to enforce that two values should not be inserted separatly. I.e. to enforce a form of "constraint" to avoid programatic error.

      Those are not widely useful things, but my reasoning is that if whatever syntax we come up for "grouping" row key in a composite trivially extends to column values, why not support it.

      As for syntax I have 3 suggestions (that are just that, suggestions):

      1. If we only care about allowing grouping for row keys:
        CREATE TABLE timeline (
            name text,
            month int,
            ts timestamp,
            value text,
            PRIMARY KEY ((name, month), ts)
        )
        
      2. A syntax that could work for both grouping in row key and colum value:
        CREATE TABLE timeline (
            name text,
            month int,
            ts timestamp,
            value1 text,
            value2 text,
            GROUP (name, month) as key,
            GROUP (value1, value2),
            PRIMARY KEY (key, ts)
        )
        
      3. An alternative to the preceding one:
        CREATE TABLE timeline (
            name text,
            month int,
            ts timestamp,
            value1 text,
            value2 text,
            GROUP (name, month) as key,
            GROUP (value1, value2),
            PRIMARY KEY (key, ts)
        ) WITH GROUP (name, month) AS key
           AND GROUP (value1, value2)
        
      1. 4179.txt
        57 kB
        Sylvain Lebresne
      2. 4179-2.txt
        64 kB
        Sylvain Lebresne

        Issue Links

          Activity

          Hide
          Jonathan Ellis added a comment -

          It's actually exactly like a "real" PK; you can have exactly one row (collection of cells) per combination of those columns.

          Show
          Jonathan Ellis added a comment - It's actually exactly like a "real" PK; you can have exactly one row (collection of cells) per combination of those columns.
          Hide
          Edward Capriolo added a comment -

          I see that I am probably too late to this discussion, but did want to point out we may be mis-using the term primary-key here. In my mind the CQL3 compound primary key is more like the M$ SQL Cluster index, which controls sorting.

          This syntax might do more to say "These fields are the primary key" and "these fields are the ones we sort by".

          CREATE TABLE timeline (
              name text,
              month int,
              ts timestamp,
              value text,
              PRIMARY KEY (name, month),
              clustered index (name,month,tx)
          )
          
          Show
          Edward Capriolo added a comment - I see that I am probably too late to this discussion, but did want to point out we may be mis-using the term primary-key here. In my mind the CQL3 compound primary key is more like the M$ SQL Cluster index, which controls sorting. This syntax might do more to say "These fields are the primary key" and "these fields are the ones we sort by". CREATE TABLE timeline ( name text, month int, ts timestamp, value text, PRIMARY KEY (name, month), clustered index (name,month,tx) )
          Hide
          Sylvain Lebresne added a comment -

          Yes, sorry, I though I had push the fix yesterday but apparently I forgot the final push. Done in 4bc873b79842c1a4debaa0b3da4a.

          Show
          Sylvain Lebresne added a comment - Yes, sorry, I though I had push the fix yesterday but apparently I forgot the final push. Done in 4bc873b79842c1a4debaa0b3da4a.
          Hide
          Jonathan Ellis added a comment -

          Added 844b9c46c9ceb08a52da7738c8312e8c0b599737 to fix the build, but after that CliTest shows a bunch of errors that it does not prior to this.

          Show
          Jonathan Ellis added a comment - Added 844b9c46c9ceb08a52da7738c8312e8c0b599737 to fix the build, but after that CliTest shows a bunch of errors that it does not prior to this.
          Hide
          Sylvain Lebresne added a comment -

          Committed (with nits fixed), thanks

          Show
          Sylvain Lebresne added a comment - Committed (with nits fixed), thanks
          Hide
          Pavel Yaskevich added a comment -

          +1,

          Nits:
          is should be changed to if in comment in CFMetaData // We only return the alias is only one is set ...
          CFMetaData.validate() should have if statement aligned in to for loop when checking for "Cannot have key alias equals to column name ..."

          Show
          Pavel Yaskevich added a comment - +1, Nits: is should be changed to if in comment in CFMetaData // We only return the alias is only one is set ... CFMetaData.validate() should have if statement aligned in to for loop when checking for "Cannot have key alias equals to column name ..."
          Hide
          Sylvain Lebresne added a comment -

          Attaching v2 that handle the token syntax. As said previous, I think that the OR support is way out of scope for this issue so I'd say this is good for review.

          Show
          Sylvain Lebresne added a comment - Attaching v2 that handle the token syntax. As said previous, I think that the OR support is way out of scope for this issue so I'd say this is good for review.
          Hide
          Sylvain Lebresne added a comment -

          But if we can figure out how to parse and recognize the OR syntax that would be my preference

          Mine too, especially given it's reasonably general and might have some other uses than this one. I will require a fair amount of work to support however (I'm not too afraid about parsing but SelectStatement will probably need a good make over; though I was meaning to refactor it at some point anyway because it starts to be ugly, so that can be a good occasion). I'll definitely make that a separate issue

          SELECT * FROM timeline WHERE token(name, month) > token('bar', 4)

          LGTM.

          Ok, I'll work that out.

          Show
          Sylvain Lebresne added a comment - But if we can figure out how to parse and recognize the OR syntax that would be my preference Mine too, especially given it's reasonably general and might have some other uses than this one. I will require a fair amount of work to support however (I'm not too afraid about parsing but SelectStatement will probably need a good make over; though I was meaning to refactor it at some point anyway because it starts to be ugly, so that can be a good occasion). I'll definitely make that a separate issue SELECT * FROM timeline WHERE token(name, month) > token('bar', 4) LGTM. Ok, I'll work that out.
          Hide
          Jonathan Ellis added a comment -

          since we decided to not support it for composite comparators, it's consistent to just say we don't support it here

          +1

          if we have more than one key alias, then we don't return any to thrift

          sgtm.

          there is no way to do a 'IN' on the full row key

          Hmm, that's a tricky one. The most natural way to express this in SQL would be

          SELECT * FROM timeline WHERE (name = 'foo' AND month = 1) OR (name = 'bar' AND month = 4)
          

          or possibly

          SELECT * FROM timeline WHERE name = 'foo' AND month = 1 
          UNION ALL
          SELECT * FROM timeline WHERE name = 'bar' AND month = 4
          

          I think the original argument against supporting UNION still holds – i.e., that it's too general, and allows queries against completely different indexes or even different tables, to be smashed together, which is not our goal. But if we can figure out how to parse and recognize the OR syntax that would be my preference.

          SELECT * FROM timeline WHERE token(name, month) > token('bar', 4)

          LGTM.

          Show
          Jonathan Ellis added a comment - since we decided to not support it for composite comparators, it's consistent to just say we don't support it here +1 if we have more than one key alias, then we don't return any to thrift sgtm. there is no way to do a 'IN' on the full row key Hmm, that's a tricky one. The most natural way to express this in SQL would be SELECT * FROM timeline WHERE (name = 'foo' AND month = 1) OR (name = 'bar' AND month = 4) or possibly SELECT * FROM timeline WHERE name = 'foo' AND month = 1 UNION ALL SELECT * FROM timeline WHERE name = 'bar' AND month = 4 I think the original argument against supporting UNION still holds – i.e., that it's too general, and allows queries against completely different indexes or even different tables, to be smashed together, which is not our goal. But if we can figure out how to parse and recognize the OR syntax that would be my preference. SELECT * FROM timeline WHERE token(name, month) > token('bar', 4) LGTM.
          Hide
          Sylvain Lebresne added a comment -

          Attaching patch for row keys only using the tuple syntax (i.e. the first in the description above). Relating to my 2 points above:

          • the number of components of the row key cannot be extended.
          • thrift is not modified and if the key is composite, i.e. if we have more than one key alias, then we don't return any to thrift.

          Other than that, the patch has a few limitations/missing parts:

          • Taking the example in the description, you can do
            SELECT * FROM timeline WHERE name = 'foo' AND month IN (1, 4, 8)
            

            but there is no way to do a 'IN' on the full row key, mostly because we're lacking the syntax to express it. I.e. we would need something like:

            SELECT * FROM timeline WHERE (name, month) IN (('foo', 1), ('bar', 4))
            

            Now that is probably not a blocker per se for the patch to be committed, but I think we should at least figure out a syntax because that kind of 'IN' query can be convenient when you do manual indexing so imo it is worth supporting it (I know there is the old argument that IN query are not really useful and instead client drivers should handle the parallelism themselves, but I don't really agree, especially not for CQL3 where one of the goal is to remove burden from the client driver).

          • For a very similar reason, the token() function from CASSANDRA-3771 is not supported on composite keys. One similar syntax for that could be
            SELECT * FROM timeline WHERE token(name, month) > token('bar', 4)
            SELECT * FROM timeline WHERE token(name, month) > "19389135324729031"
            

            Opinion?

          Show
          Sylvain Lebresne added a comment - Attaching patch for row keys only using the tuple syntax (i.e. the first in the description above). Relating to my 2 points above: the number of components of the row key cannot be extended. thrift is not modified and if the key is composite, i.e. if we have more than one key alias, then we don't return any to thrift. Other than that, the patch has a few limitations/missing parts: Taking the example in the description, you can do SELECT * FROM timeline WHERE name = 'foo' AND month IN (1, 4, 8) but there is no way to do a 'IN' on the full row key, mostly because we're lacking the syntax to express it. I.e. we would need something like: SELECT * FROM timeline WHERE (name, month) IN (('foo', 1), ('bar', 4)) Now that is probably not a blocker per se for the patch to be committed, but I think we should at least figure out a syntax because that kind of 'IN' query can be convenient when you do manual indexing so imo it is worth supporting it (I know there is the old argument that IN query are not really useful and instead client drivers should handle the parallelism themselves, but I don't really agree, especially not for CQL3 where one of the goal is to remove burden from the client driver). For a very similar reason, the token() function from CASSANDRA-3771 is not supported on composite keys. One similar syntax for that could be SELECT * FROM timeline WHERE token(name, month) > token('bar', 4) SELECT * FROM timeline WHERE token(name, month) > "19389135324729031" Opinion?
          Hide
          Sylvain Lebresne added a comment -

          Looking at that more closely, there is a few details on which to decide. At least:

          • Do we care about allowing to add new columns to a composite row key? I guess since we decided to not support it for composite comparators, it's consistent to just say we don't support it here. But I'll just note that if we want to support it that's not trivial because row key with one element won't be composite underneath so we won't be able to "extend" them.
          • What about thrift? This require changing key_alias to be key_aliases. Should we just say that from thrift you can only see/set the first key alias even if there are more than one? If we do that, working with those table from thrift will be more difficult though. That is really the same problem that discussed in CASSANDRA-4377, do we care than thrift client might not have enough information to work with CQL3 table correctly?
          Show
          Sylvain Lebresne added a comment - Looking at that more closely, there is a few details on which to decide. At least: Do we care about allowing to add new columns to a composite row key? I guess since we decided to not support it for composite comparators, it's consistent to just say we don't support it here. But I'll just note that if we want to support it that's not trivial because row key with one element won't be composite underneath so we won't be able to "extend" them. What about thrift? This require changing key_alias to be key_aliases. Should we just say that from thrift you can only see/set the first key alias even if there are more than one? If we do that, working with those table from thrift will be more difficult though. That is really the same problem that discussed in CASSANDRA-4377 , do we care than thrift client might not have enough information to work with CQL3 table correctly?
          Hide
          Jonathan Ellis added a comment -

          I'd like to use this for CASSANDRA-4285, and I'm fine with the "tuple" syntax for partition key compositing. Is this reasonable to do in the next week or two?

          Show
          Jonathan Ellis added a comment - I'd like to use this for CASSANDRA-4285 , and I'm fine with the "tuple" syntax for partition key compositing. Is this reasonable to do in the next week or two?
          Hide
          Jonathan Ellis added a comment -

          Composite in the key seems a lot more useful to me moving forward... I'd lean towards saying "declare your value a blob and destructure it client-side" for values – which is what upgraders will be doing already, so it's not a big deal in that respect.

          Show
          Jonathan Ellis added a comment - Composite in the key seems a lot more useful to me moving forward... I'd lean towards saying "declare your value a blob and destructure it client-side" for values – which is what upgraders will be doing already, so it's not a big deal in that respect.
          Hide
          Sylvain Lebresne added a comment -

          On the 'does support for composite in the column value make sense' question, I'll add that supporting them would potentially make for a nicer upgrade for people coming from thrift/CQL2 that already use composites as the defaultValidator. Not sure if that's really common though, and we should add support just because of that, but that's still a additional slight pro for adding it.

          Show
          Sylvain Lebresne added a comment - On the 'does support for composite in the column value make sense' question, I'll add that supporting them would potentially make for a nicer upgrade for people coming from thrift/CQL2 that already use composites as the defaultValidator. Not sure if that's really common though, and we should add support just because of that , but that's still a additional slight pro for adding it.
          Hide
          Sylvain Lebresne added a comment -

          PS: I know that this is related to CASSANDRA-4176 but see my (first) comment there.

          Show
          Sylvain Lebresne added a comment - PS: I know that this is related to CASSANDRA-4176 but see my (first) comment there.

            People

            • Assignee:
              Sylvain Lebresne
              Reporter:
              Sylvain Lebresne
              Reviewer:
              Pavel Yaskevich
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development