Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0
    • Component/s: API
    • Labels:

      Description

      This ticket is a reformulation/generalization of CASSANDRA-2474. The core change of CQL 3.0 is to introduce the new syntaxes that were discussed in CASSANDRA-2474 that allow to:

      1. Provide a better/more native support for wide rows, using the idea of transposed vie.
      2. The generalization to composite columns.

      The attached text file create_cf_syntaxes.txt recall the new syntaxes introduced.

      The changes proposed above allow (and strongly suggest in some cases) a number of other changes to the language that this ticket proposes to explore/implement (more details coming in the comments).

      1. 0001-CQL-3.0.patch
        223 kB
        Sylvain Lebresne
      2. 0001-CQL-3.0-v2.patch
        225 kB
        Sylvain Lebresne
      3. 0002-Add-support-for-switching-the-CQL-version.patch
        23 kB
        Sylvain Lebresne
      4. 0002-Add-support-for-switching-the-CQL-version-v2.patch
        23 kB
        Sylvain Lebresne
      5. 0003-Makes-batches-atomic.patch
        6 kB
        Sylvain Lebresne
      6. 0003-Makes-batches-atomic-v2.patch
        6 kB
        Sylvain Lebresne
      7. 0004-Thrift-gen-files.patch
        80 kB
        Sylvain Lebresne
      8. 0004-Thrift-gen-files-v2.patch
        80 kB
        Sylvain Lebresne
      9. 0005-Rebase-after-CASSANDRA-1391.patch
        21 kB
        Sylvain Lebresne
      10. cql_tests.py
        16 kB
        Sylvain Lebresne
      11. create_cf_syntaxes.txt
        2 kB
        Sylvain Lebresne
      1.
      KEY IN (...) queries do not work Sub-task Resolved Sylvain Lebresne
       
      2.
      need forked language document Sub-task Resolved Sylvain Lebresne
       
      3.
      cassandra defaults to CQLv3 Sub-task Resolved Unassigned
       
      4.
      Secondary indexes support for wide rows in CQL 3.0 Sub-task Resolved Sylvain Lebresne
       
      5.
      Add Support for Composite Secondary Indexes Sub-task Resolved Sylvain Lebresne
       
      6.
      Add 'null' support to CQL 3.0 Sub-task Resolved Michał Michalski
       
      7.
      Add KeyspaceChange CqlResultType Sub-task Resolved satish babu krishnamoorthy
       
      8.
      Support query by names for compact CF Sub-task Resolved Sylvain Lebresne
       
      9.
      Allow paging through non-ordered partitioner results in CQL3 Sub-task Resolved Sylvain Lebresne
       
      10.
      Support slice with exclusive start and stop Sub-task Resolved Sylvain Lebresne
       
      11.
      CQL count() needs paging support Sub-task Resolved Unassigned
       
      12.
      Dropping a column should do more than just remove the definition Sub-task Resolved Aleksey Yeschenko
       
      13.
      ORDER BY syntax Sub-task Resolved Sylvain Lebresne
       
      14.
      Consider providing error code with exceptions (and documenting them) Sub-task Resolved Sylvain Lebresne
       
      15.
      Explore not returning range ghosts Sub-task Resolved Sylvain Lebresne
       
      16.
      cqlsh support for CQL 3 Sub-task Resolved paul cannon
       
      17.
      Add support for ReversedType Sub-task Resolved Sylvain Lebresne
       
      18.
      Allow updating column_alias types Sub-task Resolved Pavel Yaskevich
       
      19.
      Support for sharding wide rows in CQL 3.0 Sub-task Resolved Unassigned
       
      20.
      Add more general support for composites (to row key, column value) Sub-task Resolved Sylvain Lebresne
       
      21.
      Easy access to column timestamps (and maybe ttl) during queries Sub-task Resolved Sylvain Lebresne
       

        Activity

        Hide
        Sylvain Lebresne added a comment -

        Attaching patches. I'll try to provide the rational for the changes introduces (on top of basic support for the syntaxes from create_cf_syntaxes.txt). All feedback/comments on those will be greatly appreciated, I'm not trying to impose anything, but I do think after much reflexion that those choices are the right way to move forward.

        First, on the backward compatibility issue: the changes proposed are breaking changes (as indicated by the major version bump). The patch creates a new java package cql3 that is completely separate from the cql package. So for a time both cql 2.0 and 3.0 will be supported. To make that work, the patch adds a new thrift method set_cql_version(String) to allow setting the version during the client session (in the second patch). The default version in the patch is 3.0 but it's only because it makes testing easier until drivers adds support for this new method. Based on the discussion we had on CASSANDRA-2474, what I propose is that for C* 1.1 we add only 3.0 as a beta/demonstration version with 2.0 still being the default. If everything goes well, in C* 1.2, cql 3.0 will become the default and 2.0 will be supported by deprecated.

        One change the patches does is to make static CF (i.e, the ones that don't use COMPACT STORAGE) really static, i.e. adding non defined columns is not supported. The reasons are numerous (I'll probably even forget some):

        • If we were to allow non defined columns in static CF, we would pretty much allow to use a static CF as a wide row. So we would have to support doing a slice on those static CF. But that is kind of contradictory with the introduction of the transposed idea. And if we don't allow slices on static CF, what is the point of allowing random columns?
        • It makes the code simpler. In particular it avoids complication with sparse composites.
        • It's helpful to users. If you use a static CF, we tell you when you do a typo inserting a column (I do think it's a very useful thing).
        • It's not a limitation. If you need a new column, you can do an ALTER ADD, it's cheap. And if your columns names are really "random", then what you really want is a transposed/compact CF anyway.
        • It means it makes sense to limit prepared markers to the right side of a relation. That in turns allows to do a bit more work during preparation and make stuff like CASSANDRA-3753 possible/easy.
        • It make the language much closer to SQL. Don't get me wrong, I don't like SQL all that much, but CQL does reuses SQL syntax and core concepts. Making it easier on all the people that know SQL is a good thing provided there is no downside to do it. And in that case I don't think there is one, outside of breaking compatibility with CQL 2.0, which will be broken anyway.

        Another thing the patch does is to add consistency to our handling of case-sensitivity for column names. What I mean here is that currently, when you declare:

        CREATE TABLE (
            MyKey text PRIMARY KEY,
            Column1 int,
            Column2 int,
        )
        

        then MyKey is case-insensitive and Column1 and Column2 are case-sensitive. We should fix that inconsistency. The patch makes the choice to reuse the way SQL (at least PostgreSQL) deal with this: all definition names (MyKey, Column1 and Column2 in my example) are case insensitive by default (they are lowercased basically) but you can specify a case sensitive one using double-quotes. The rational is that with the static is static idea above, we are in a case very similar to SQL and so I didn't see a very good reason to do things differently (again, except for the issue of backward compatibility). Note that in the wide row (transposed) case, the C* column name is not a 'definition name', so that rule won't apply. For consistency sake, keyspace and column family names also follow the same rule.

        We could alternatively make everything case-sensitive, but at least the rule should be the same for all definitions, whether it is a PRIMARY KEY or not.

        On the code itself: the changes described above are extensive in that they involve an almost complete rewrite of the select, createCF, update, delete and alterTable statements, as well as a non-trivial amount of changes to the grammar. While doing that, I saw a number of things that could be improved/generalized to make the code more readable (typically sometimes the code of a statement was entirely in QueryProcessor, sometimes it was in the statement class, sometimes it was split between both, etc..) and to fix a number of (minor) issues. And because we have decided that cql 3 is a 'fork' of cql 2, I decided it would be a good occasion to improve the code. One thing leading to another, the patch refactors a good chunk of the cql code. I know that it's not the way we usually do things but I think the circumstance are a bit difference this time in that the patch is (almost) only new code, and we've agreed cql3 will be beta to start with. I'm also willing to devote a fair chunk of my time to the testing of that new cql version. I hope you guys won't be pissed off by this. In any case, the patches fix the following issues with cql (in the new cql3 that is, it does not backport the fixes to cql 2.0):

        • Handles correctly the case where an update inside a batch override the current keyspace
        • Support overriding the current keyspace in all statements that apply to a CF. I.e, add the optional override to CreateCF, CreateIndex, atlterTable and dropCF.
        • make counters work with prepared statements (the grammar don't allow markers like 'X = X - ?', and even if it was, the code had to be updated to handle it correctly).
        • Delete's consistency level isn't validated
        • Correctly batch batches. I.e, create only one RowMutation for a given (keyspace, key) for the whole batch statement. That fix is in a separated patch.

        Last note: the patch does not make any attempt to support transparently super columns with the new notations. I think trying to do so would be messy, so I think it will be a better use of our time to internally transform super columns into composites (aka CASSANDRA-3237).

        Show
        Sylvain Lebresne added a comment - Attaching patches. I'll try to provide the rational for the changes introduces (on top of basic support for the syntaxes from create_cf_syntaxes.txt). All feedback/comments on those will be greatly appreciated, I'm not trying to impose anything, but I do think after much reflexion that those choices are the right way to move forward. First, on the backward compatibility issue: the changes proposed are breaking changes (as indicated by the major version bump). The patch creates a new java package cql3 that is completely separate from the cql package. So for a time both cql 2.0 and 3.0 will be supported. To make that work, the patch adds a new thrift method set_cql_version(String) to allow setting the version during the client session (in the second patch). The default version in the patch is 3.0 but it's only because it makes testing easier until drivers adds support for this new method. Based on the discussion we had on CASSANDRA-2474 , what I propose is that for C* 1.1 we add only 3.0 as a beta/demonstration version with 2.0 still being the default. If everything goes well, in C* 1.2, cql 3.0 will become the default and 2.0 will be supported by deprecated. One change the patches does is to make static CF (i.e, the ones that don't use COMPACT STORAGE) really static, i.e. adding non defined columns is not supported. The reasons are numerous (I'll probably even forget some): If we were to allow non defined columns in static CF, we would pretty much allow to use a static CF as a wide row. So we would have to support doing a slice on those static CF. But that is kind of contradictory with the introduction of the transposed idea. And if we don't allow slices on static CF, what is the point of allowing random columns? It makes the code simpler. In particular it avoids complication with sparse composites. It's helpful to users. If you use a static CF, we tell you when you do a typo inserting a column (I do think it's a very useful thing). It's not a limitation. If you need a new column, you can do an ALTER ADD, it's cheap. And if your columns names are really "random", then what you really want is a transposed/compact CF anyway. It means it makes sense to limit prepared markers to the right side of a relation. That in turns allows to do a bit more work during preparation and make stuff like CASSANDRA-3753 possible/easy. It make the language much closer to SQL. Don't get me wrong, I don't like SQL all that much, but CQL does reuses SQL syntax and core concepts. Making it easier on all the people that know SQL is a good thing provided there is no downside to do it. And in that case I don't think there is one, outside of breaking compatibility with CQL 2.0, which will be broken anyway. Another thing the patch does is to add consistency to our handling of case-sensitivity for column names. What I mean here is that currently, when you declare: CREATE TABLE ( MyKey text PRIMARY KEY, Column1 int, Column2 int, ) then MyKey is case-insensitive and Column1 and Column2 are case-sensitive. We should fix that inconsistency. The patch makes the choice to reuse the way SQL (at least PostgreSQL) deal with this: all definition names (MyKey, Column1 and Column2 in my example) are case insensitive by default (they are lowercased basically) but you can specify a case sensitive one using double-quotes. The rational is that with the static is static idea above, we are in a case very similar to SQL and so I didn't see a very good reason to do things differently (again, except for the issue of backward compatibility). Note that in the wide row (transposed) case, the C* column name is not a 'definition name', so that rule won't apply. For consistency sake, keyspace and column family names also follow the same rule. We could alternatively make everything case-sensitive, but at least the rule should be the same for all definitions, whether it is a PRIMARY KEY or not. On the code itself: the changes described above are extensive in that they involve an almost complete rewrite of the select, createCF, update, delete and alterTable statements, as well as a non-trivial amount of changes to the grammar. While doing that, I saw a number of things that could be improved/generalized to make the code more readable (typically sometimes the code of a statement was entirely in QueryProcessor, sometimes it was in the statement class, sometimes it was split between both, etc..) and to fix a number of (minor) issues. And because we have decided that cql 3 is a 'fork' of cql 2, I decided it would be a good occasion to improve the code. One thing leading to another, the patch refactors a good chunk of the cql code. I know that it's not the way we usually do things but I think the circumstance are a bit difference this time in that the patch is (almost) only new code, and we've agreed cql3 will be beta to start with. I'm also willing to devote a fair chunk of my time to the testing of that new cql version. I hope you guys won't be pissed off by this. In any case, the patches fix the following issues with cql (in the new cql3 that is, it does not backport the fixes to cql 2.0): Handles correctly the case where an update inside a batch override the current keyspace Support overriding the current keyspace in all statements that apply to a CF. I.e, add the optional override to CreateCF, CreateIndex, atlterTable and dropCF. make counters work with prepared statements (the grammar don't allow markers like 'X = X - ?', and even if it was, the code had to be updated to handle it correctly). Delete's consistency level isn't validated Correctly batch batches. I.e, create only one RowMutation for a given (keyspace, key) for the whole batch statement. That fix is in a separated patch. Last note: the patch does not make any attempt to support transparently super columns with the new notations. I think trying to do so would be messy, so I think it will be a better use of our time to internally transform super columns into composites (aka CASSANDRA-3237 ).
        Hide
        Sylvain Lebresne added a comment -

        Also attaching the tests that I wrote so far to test that issue. They are for https://github.com/riptano/cassandra-dtest because I did not bother yet to move them some place else. It's far for covering everything but it's a start.

        Show
        Sylvain Lebresne added a comment - Also attaching the tests that I wrote so far to test that issue. They are for https://github.com/riptano/cassandra-dtest because I did not bother yet to move them some place else. It's far for covering everything but it's a start.
        Hide
        Jawahar Prasad JP added a comment - - edited

        This makes sense.. But can you please point out the difference between these two? (maybe in terms of performance, etc.,)

        1)

        CREATE TABLE timeline (
        userid uuid,
        posted_at date,
        body text,
        posted_by text,
        PRIMARY KEY (user_id, posted_at),
        );

        SELECT body, posted_by FROM timeline WHERE userid = '...' and posted_at = '2 janvier 2010'
        SELECT posted_ad, body, posted_by FROM timeline WHERE userid = '...' and posted_at > '2 janvier 2010'

        2)

        CREATE TABLE timeline (
        userid uuid PRIMARY KEY,
        posted_at date,
        body text,
        posted_by text
        );
        CREATE INDEX ON timeline(posted_at);

        SELECT body, posted_by FROM timeline WHERE userid = '...' and posted_at = '2 janvier 2010'
        SELECT posted_ad, body, posted_by FROM timeline WHERE userid = '...' and posted_at > '2 janvier 2010'

        And also(a dumb question),
        If I set PRIMARY KEY (user_id, posted_at) , I MUST ALWAYS filter by userid and posted_at ?(WHERE userid = '...' and posted_at > '2 janvier 2010')

        Show
        Jawahar Prasad JP added a comment - - edited This makes sense.. But can you please point out the difference between these two? (maybe in terms of performance, etc.,) 1) CREATE TABLE timeline ( userid uuid, posted_at date, body text, posted_by text, PRIMARY KEY (user_id, posted_at), ); SELECT body, posted_by FROM timeline WHERE userid = '...' and posted_at = '2 janvier 2010' SELECT posted_ad, body, posted_by FROM timeline WHERE userid = '...' and posted_at > '2 janvier 2010' 2) CREATE TABLE timeline ( userid uuid PRIMARY KEY, posted_at date, body text, posted_by text ); CREATE INDEX ON timeline(posted_at); SELECT body, posted_by FROM timeline WHERE userid = '...' and posted_at = '2 janvier 2010' SELECT posted_ad, body, posted_by FROM timeline WHERE userid = '...' and posted_at > '2 janvier 2010' And also(a dumb question), If I set PRIMARY KEY (user_id, posted_at) , I MUST ALWAYS filter by userid and posted_at ?(WHERE userid = '...' and posted_at > '2 janvier 2010')
        Hide
        Sylvain Lebresne added a comment -

        @jawahar In you 2), userid is the only key, and hence each user can only have one posted_at (which is unlikely what you want in that example). Creating an index on posted_at doesn't change that, it just make it possible to search for the users based on their unique posted_at value (in particular, the 2nd query or the 2nd example doesn't make a whole lot of sense). And in the first example, if you only specifiy 'WHERE userid = '..'' for instance, it'll return all the records for that userid.

        Show
        Sylvain Lebresne added a comment - @jawahar In you 2), userid is the only key, and hence each user can only have one posted_at (which is unlikely what you want in that example). Creating an index on posted_at doesn't change that, it just make it possible to search for the users based on their unique posted_at value (in particular, the 2nd query or the 2nd example doesn't make a whole lot of sense). And in the first example, if you only specifiy 'WHERE userid = '..'' for instance, it'll return all the records for that userid.
        Hide
        Jawahar Prasad JP added a comment -

        Ok. I got it. Another question:

        1. For example: I have columnfamily to store the broken links :

        broken_link_id uuid PRIMARY KEY,
        site_id uuid,
        site_url text,
        page_url text,
        broken_link text,
        anchor_text text,
        tstamp timestamp

        I want something like this:
        site_id, site_url, page_url,

        {broken_link_id,broken_link,anchor_text,tstamp}

        I came up with this WRONG version:
        CREATE COLUMNFAMILY broken_links (
        broken_link_id uuid,
        site_id uuid,
        site_url text,
        page_url text,
        broken_link text,
        anchor_text text,
        tstamp timestamp,
        PRIMARY KEY(site_id, site_url, page_url)
        );
        Can you let me know how do I write this correctly please ? Or is there any doc to take a look at this(for CQL) ?

        2. How do you know what columns to be considered as composite based on PRIMARY KEY?

        Show
        Jawahar Prasad JP added a comment - Ok. I got it. Another question: 1. For example: I have columnfamily to store the broken links : broken_link_id uuid PRIMARY KEY, site_id uuid, site_url text, page_url text, broken_link text, anchor_text text, tstamp timestamp I want something like this: site_id, site_url, page_url, {broken_link_id,broken_link,anchor_text,tstamp} I came up with this WRONG version: CREATE COLUMNFAMILY broken_links ( broken_link_id uuid, site_id uuid, site_url text, page_url text, broken_link text, anchor_text text, tstamp timestamp, PRIMARY KEY(site_id, site_url, page_url) ); Can you let me know how do I write this correctly please ? Or is there any doc to take a look at this(for CQL) ? 2. How do you know what columns to be considered as composite based on PRIMARY KEY?
        Hide
        Jonathan Ellis added a comment -

        The PK(X, Y) syntax tells Cassandra to use composite columns internally for that row. This has two effects. First, as Sylvain mentioned above, if you just have the userid as PK then (just as in SQL) you may only have one row with that key, which is not what you want here. Second, it gives you some sorting guarantees for range queries like your posted_at > Z example. See http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1 for more background on composites.

        That said, this really isn't the right place to discuss the basics of querying composites. Let's keep that on the user mailing list, please.

        Show
        Jonathan Ellis added a comment - The PK(X, Y) syntax tells Cassandra to use composite columns internally for that row. This has two effects. First, as Sylvain mentioned above, if you just have the userid as PK then (just as in SQL) you may only have one row with that key, which is not what you want here. Second, it gives you some sorting guarantees for range queries like your posted_at > Z example. See http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1 for more background on composites. That said, this really isn't the right place to discuss the basics of querying composites. Let's keep that on the user mailing list, please.
        Hide
        Pavel Yaskevich added a comment -

        Cql.g looks good.

        Show
        Pavel Yaskevich added a comment - Cql.g looks good.
        Hide
        Jonathan Ellis added a comment -

        Looking at SS.multiRangeSlice: I think we should address the key/token problem as well: CQL < 3 silently turns a "key >= X" into "token(key) >= token(X)". This is not what users will expect, since many of the rows returned will not in fact satisfy the requested key inequality. I've opened CASSANDRA-3771 for this – I don't think we'll get consensus in time to incorporate that here, so I suggest instead just disallowing range queries on row keys for non-OPP for now. We can add one in later without breaking anything, and in the meantime, Hadoop can continue to page via Thrift. (See discussion on CASSANDRA-2878.)

        I don't understand the comment here:

                   sliceRange.count = 1; // We use this for range slices, where the count is ignored in favor of the global column count
        

        My inclination would be if we're going to ignore it, let's set it to something obviously bogus, like -1.

        I have to admit I'm not a huge fan of the Restriction class. Some additional encapsulation might help there since "r == null || !r.isEquality()" checks seem scattered around right now.

        What is CFDefinition.Name trying to encapsulate? "Name value" declaration confuses me.

        In my mind a classic, dynamic CF is the simplest possible dense, not a separate type. So I'm not sure the distinction of Kind.dynamic is useful. (I note that you already have these as the same case in SS.process.)

        Nit: it's more clear to avoid double-negatives, as in the if/else in if (!isKeyRange()) ... else .... Similar nit in ... lots of places.

        Show
        Jonathan Ellis added a comment - Looking at SS.multiRangeSlice: I think we should address the key/token problem as well: CQL < 3 silently turns a "key >= X" into "token(key) >= token(X)". This is not what users will expect, since many of the rows returned will not in fact satisfy the requested key inequality. I've opened CASSANDRA-3771 for this – I don't think we'll get consensus in time to incorporate that here, so I suggest instead just disallowing range queries on row keys for non-OPP for now. We can add one in later without breaking anything, and in the meantime, Hadoop can continue to page via Thrift. (See discussion on CASSANDRA-2878 .) I don't understand the comment here: sliceRange.count = 1; // We use this for range slices, where the count is ignored in favor of the global column count My inclination would be if we're going to ignore it, let's set it to something obviously bogus, like -1. I have to admit I'm not a huge fan of the Restriction class. Some additional encapsulation might help there since "r == null || !r.isEquality()" checks seem scattered around right now. What is CFDefinition.Name trying to encapsulate? "Name value" declaration confuses me. In my mind a classic, dynamic CF is the simplest possible dense, not a separate type. So I'm not sure the distinction of Kind.dynamic is useful. (I note that you already have these as the same case in SS.process.) Nit: it's more clear to avoid double-negatives, as in the if/else in if (!isKeyRange()) ... else ... . Similar nit in ... lots of places.
        Hide
        Sylvain Lebresne added a comment -

        so I suggest instead just disallowing range queries on row keys for non-OPP for now.

        I don't disagree, but I'll note that the problem is the same for indexed queries. We can only allow full-ring range slices however (maybe that's what you meant); i.e. disallow having a non-equal restriction on the row key, but allow to have no restrictions at all. As a side I'll note that imho that's yet one more reason to spit PK in two (the argument being that when you start having too many different restrictions on the first component of the PK and the others, maybe it's worth acknowledging it in the syntax).

        My inclination would be if we're going to ignore it, let's set it to something obviously bogus, like -1

        Sure. For some reason I cannot really explain I wanted to avoid 0 and went with 1, but -1 is better.

        I have to admit I'm not a huge fan of the Restriction class

        As the comments in the code says, I'm not an absolute fan either, but it was still the best I could come up at the time. The goal being to make validation and querying as fast as possible. I'm open to suggestions for a better encapsulation though.

        What is CFDefinition.Name trying to encapsulate?

        It's representing what I could call for lack of a better name a 'CQL column name'. I.e, the name of one of the definition in the CREATE TABLE definition. And in the case of Compact Storage, one of those 'CQL column name' happens to be the name for the C* value the record represent. So yes, there is some clash of terminology: a CQL column can be, in C*, one of: the row key, the columm name, a component of the colum name or the column value. Again, I'm open to suggestions to better names that avoid the confusion.

        "Name value" declaration confuses me

        So to complete my comment above, you can see CFDefinition as the correspondence between the CQL names defined in the CREATE TABLE and what they represent internally. 'Name value' is the name used to represent the C* value in the COMPACT STORAGE case. Maybe calling it value_alias would avoid the confusion? (I avoided calling them 'alias' at first, because from the CQL standpoing, they are not mere aliases but actual (CQL) column names).

        In my mind a classic, dynamic CF is the simplest possible dense, not a separate type. So I'm not sure the distinction of Kind.dynamic is useful.

        Yes but internally a dynamic CF don't use a composite, so it is bound to have some differences internally. That being said, now that I think about it, I should be able to make those differences even smaller by pushing them inside the CompositeType.Builder. That'll probably also merge a few differences between sparse and static.

        Now that remind me of a detail that is worth mentioning. In the current patch, if you declare the following:

        CREATE TABLE foo (key text, col text, value text, PRIMARY KEY (key, col)) WITH COMPACT STORAGE;
        

        and you do a SELECT * FROM foo WHERE key = 'k', then it'll return all the record for key 'k'. But if now you declare:

        CREATE TABLE foo (key text, col1 text, col2 text, value text, PRIMARY KEY (key, col1, col2)) WITH COMPACT STORAGE;
        

        and do SELECT * FROM foo WHERE key = 'k' AND col1 = 'c', then you will only get one record back, the one where col1 is 'c' and col2 is not defined, if it exists, not all the record that have as prefix col1 == 'c'. The reason being that for dense, we allow to have only a prefix of the key to be defined, and so you can do:

        INSERT INTO foo (key, col1) VALUES ('k', 'c')
        

        One option would be to disallow doing this, but it would add an actual limitation to what is possible.

        Show
        Sylvain Lebresne added a comment - so I suggest instead just disallowing range queries on row keys for non-OPP for now. I don't disagree, but I'll note that the problem is the same for indexed queries. We can only allow full-ring range slices however (maybe that's what you meant); i.e. disallow having a non-equal restriction on the row key, but allow to have no restrictions at all. As a side I'll note that imho that's yet one more reason to spit PK in two (the argument being that when you start having too many different restrictions on the first component of the PK and the others, maybe it's worth acknowledging it in the syntax). My inclination would be if we're going to ignore it, let's set it to something obviously bogus, like -1 Sure. For some reason I cannot really explain I wanted to avoid 0 and went with 1, but -1 is better. I have to admit I'm not a huge fan of the Restriction class As the comments in the code says, I'm not an absolute fan either, but it was still the best I could come up at the time. The goal being to make validation and querying as fast as possible. I'm open to suggestions for a better encapsulation though. What is CFDefinition.Name trying to encapsulate? It's representing what I could call for lack of a better name a 'CQL column name'. I.e, the name of one of the definition in the CREATE TABLE definition. And in the case of Compact Storage, one of those 'CQL column name' happens to be the name for the C* value the record represent. So yes, there is some clash of terminology: a CQL column can be, in C*, one of: the row key, the columm name, a component of the colum name or the column value. Again, I'm open to suggestions to better names that avoid the confusion. "Name value" declaration confuses me So to complete my comment above, you can see CFDefinition as the correspondence between the CQL names defined in the CREATE TABLE and what they represent internally. 'Name value' is the name used to represent the C* value in the COMPACT STORAGE case. Maybe calling it value_alias would avoid the confusion? (I avoided calling them 'alias' at first, because from the CQL standpoing, they are not mere aliases but actual (CQL) column names). In my mind a classic, dynamic CF is the simplest possible dense, not a separate type. So I'm not sure the distinction of Kind.dynamic is useful. Yes but internally a dynamic CF don't use a composite, so it is bound to have some differences internally. That being said, now that I think about it, I should be able to make those differences even smaller by pushing them inside the CompositeType.Builder. That'll probably also merge a few differences between sparse and static. Now that remind me of a detail that is worth mentioning. In the current patch, if you declare the following: CREATE TABLE foo (key text, col text, value text, PRIMARY KEY (key, col)) WITH COMPACT STORAGE; and you do a SELECT * FROM foo WHERE key = 'k' , then it'll return all the record for key 'k'. But if now you declare: CREATE TABLE foo (key text, col1 text, col2 text, value text, PRIMARY KEY (key, col1, col2)) WITH COMPACT STORAGE; and do SELECT * FROM foo WHERE key = 'k' AND col1 = 'c' , then you will only get one record back, the one where col1 is 'c' and col2 is not defined, if it exists, not all the record that have as prefix col1 == 'c'. The reason being that for dense, we allow to have only a prefix of the key to be defined, and so you can do: INSERT INTO foo (key, col1) VALUES ('k', 'c') One option would be to disallow doing this, but it would add an actual limitation to what is possible.
        Hide
        Jonathan Ellis added a comment -

        I'll note that the problem is the same for indexed queries

        Right, "pagination" client-side is broken there as well. Again, I think we should disallow it for now and add it in when we have a solution we're happy with. (Maybe that even means the existing syntax, but I'm not convinced of that.)

        internally a dynamic CF don't use a composite, so it is bound to have some differences internally

        Maybe. In my mind I'd rather make that distinction at the "Okay we have Kind.DENSE, now do I have to treat this code differently based on cf.comparator instanceof compositetype?" But I'm happy to see what it looks like with your Builder approach.

        (That reminds me – we should probably go with COMPACT instead of DENSE, to match the CQL syntax.)

        you will only get one record back, the one where col1 is 'c' and col2 is not defined, if it exists, not all the record that have as prefix col1 == 'c'

        That sounds like a bug to me. If I wanted to restrict it to col2 not defined, I should have to write SELECT * FROM foo WHERE key = 'k' AND col1 = 'c' AND col2 = null. Which I am fine with leaving out for now, since we've ignored null so far in CQL, but the behavior of SELECT * FROM foo WHERE key = 'k' AND col1 = 'c' should definitely be "all columns with prefix c." (So maybe we need to disallow that too temporarily, until we've done some of the work for CASSANDRA-3237.)

        Show
        Jonathan Ellis added a comment - I'll note that the problem is the same for indexed queries Right, "pagination" client-side is broken there as well. Again, I think we should disallow it for now and add it in when we have a solution we're happy with. (Maybe that even means the existing syntax, but I'm not convinced of that.) internally a dynamic CF don't use a composite, so it is bound to have some differences internally Maybe. In my mind I'd rather make that distinction at the "Okay we have Kind.DENSE, now do I have to treat this code differently based on cf.comparator instanceof compositetype?" But I'm happy to see what it looks like with your Builder approach. (That reminds me – we should probably go with COMPACT instead of DENSE, to match the CQL syntax.) you will only get one record back, the one where col1 is 'c' and col2 is not defined, if it exists, not all the record that have as prefix col1 == 'c' That sounds like a bug to me. If I wanted to restrict it to col2 not defined, I should have to write SELECT * FROM foo WHERE key = 'k' AND col1 = 'c' AND col2 = null . Which I am fine with leaving out for now, since we've ignored null so far in CQL, but the behavior of SELECT * FROM foo WHERE key = 'k' AND col1 = 'c' should definitely be "all columns with prefix c." (So maybe we need to disallow that too temporarily, until we've done some of the work for CASSANDRA-3237 .)
        Hide
        Eric Evans added a comment -

        I'm still looking over this, but I thought I'd mention that I'm unable to get the tests to pass (most of them produce errors). For example:

        $ nosetests -sxv cql_tests.py 
        cql_tests.TestCQL.counters_test ... ERROR
        
        ======================================================================
        ERROR: cql_tests.TestCQL.counters_test
        ----------------------------------------------------------------------
        Traceback (most recent call last):
          File "/usr/lib/pymodules/python2.7/nose/case.py", line 187, in runTest
            self.test(*self.arg)
          File "/home/eevans/dev/src/git/cassandra/cql_tests.py", line 339, in counters_test
            cursor.execute("UPDATE clicks SET total = total + 1 WHERE userid = 1 AND url = 'http://foo.com'")
          File "/usr/local/lib/python2.7/dist-packages/cql/cursor.py", line 89, in execute
            raise cql.ProgrammingError("Bad Request: %s" % ire.why)
        ProgrammingError: Bad Request: line 1:79 mismatched character '<EOF>' expecting '''
        
        ----------------------------------------------------------------------
        Ran 1 test in 1.871s
        
        FAILED (errors=1)
        

        In this case, the error actually seems to make sense. Since // is recognized as a comment. Did the wrong version of the tests get attached maybe?

        Another example is:

        NameError: global name 'assert_invalid' is not defined
        

        Where is assert_invalid meant to come from?

        Also, I think the patches needs to be rebased in the wake of #3689 (although I think the conflicts are minor).

        Thanks!

        Show
        Eric Evans added a comment - I'm still looking over this, but I thought I'd mention that I'm unable to get the tests to pass (most of them produce errors). For example: $ nosetests -sxv cql_tests.py cql_tests.TestCQL.counters_test ... ERROR ====================================================================== ERROR: cql_tests.TestCQL.counters_test ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/nose/case.py", line 187, in runTest self.test(*self.arg) File "/home/eevans/dev/src/git/cassandra/cql_tests.py", line 339, in counters_test cursor.execute("UPDATE clicks SET total = total + 1 WHERE userid = 1 AND url = 'http://foo.com'") File "/usr/local/lib/python2.7/dist-packages/cql/cursor.py", line 89, in execute raise cql.ProgrammingError("Bad Request: %s" % ire.why) ProgrammingError: Bad Request: line 1:79 mismatched character '<EOF>' expecting ''' ---------------------------------------------------------------------- Ran 1 test in 1.871s FAILED (errors=1) In this case, the error actually seems to make sense. Since // is recognized as a comment. Did the wrong version of the tests get attached maybe? Another example is: NameError: global name 'assert_invalid' is not defined Where is assert_invalid meant to come from? Also, I think the patches needs to be rebased in the wake of #3689 (although I think the conflicts are minor). Thanks!
        Hide
        Thorsten von Eicken added a comment -

        Question: in your Standard "dynamic" CF example from create_cf_syntaxes.txt:

        CREATE TABLE clicks (
            userid uuid,
            url text,
            timestamp date
            PRIMARY KEY (userid, url)
        ) WITH COMPACT STORAGE
        

        I assume I can do a column slice using something like:

        SELECT timestamp FROM clicks WHERE userid = '...' and url >= '...' and url <= '...';
        

        Correct?

        Show
        Thorsten von Eicken added a comment - Question: in your Standard "dynamic" CF example from create_cf_syntaxes.txt: CREATE TABLE clicks ( userid uuid, url text, timestamp date PRIMARY KEY (userid, url) ) WITH COMPACT STORAGE I assume I can do a column slice using something like: SELECT timestamp FROM clicks WHERE userid = '...' and url >= '...' and url <= '...'; Correct?
        Hide
        Thorsten von Eicken added a comment -

        In #CASSANDRA-2747 you said

        My current patch don't really allow creating a secondary index on a non static CF, because it's not clear how that would work from the syntax and I'm sure I see good use for that. Does that bother someone ?

        The answer is "yes". Why? I'm finding it very useful to abuse the wide-row column names a little bit to be able to group rows and be able to operate on them in bulk. It's typically not in the frequent query path but required for management operations.

        One example: a system that stores snippets from web server logs. Each snippet is a row, each line a column, the column names simple line numbers 1, 2, 3, ... The row key is a snippet id. Data is entered in real time and I need to purge old data after a certain retention period measured in days. To do that, I abuse column 0 to store the date and put an index on that. To remove data for a day I traverse this secondary index.

        Second example: a multi-tenant system where the data has some wide rows. We need to extract a user's data for a variety of purposes, for example to migrate the data to a different cluster. We again abuse the column names a bit and add a user_id column with a secondary index.

        Unless I completely misunderstood your statement above and these secondary indexes are possible in cql3, I suspect your answer will be to use a separate CF, but that does put a lot more burden on the app, the avoidance of which is one the big benefits of secondary indexes.

        Show
        Thorsten von Eicken added a comment - In # CASSANDRA-2747 you said My current patch don't really allow creating a secondary index on a non static CF, because it's not clear how that would work from the syntax and I'm sure I see good use for that. Does that bother someone ? The answer is "yes". Why? I'm finding it very useful to abuse the wide-row column names a little bit to be able to group rows and be able to operate on them in bulk. It's typically not in the frequent query path but required for management operations. One example: a system that stores snippets from web server logs. Each snippet is a row, each line a column, the column names simple line numbers 1, 2, 3, ... The row key is a snippet id. Data is entered in real time and I need to purge old data after a certain retention period measured in days. To do that, I abuse column 0 to store the date and put an index on that. To remove data for a day I traverse this secondary index. Second example: a multi-tenant system where the data has some wide rows. We need to extract a user's data for a variety of purposes, for example to migrate the data to a different cluster. We again abuse the column names a bit and add a user_id column with a secondary index. Unless I completely misunderstood your statement above and these secondary indexes are possible in cql3, I suspect your answer will be to use a separate CF, but that does put a lot more burden on the app, the avoidance of which is one the big benefits of secondary indexes.
        Hide
        Sylvain Lebresne added a comment -

        Attached rebased and update patch set. I've mainly:

        • Refactored a bit using the 'builder approach' I talked earlier on. It does simplify a bunch of case. I've removed the CFDefinition.Kind enum and instead keep just two isComposite and isCompact flags. It's equivalent to the previous enum but maybe more clear and in any case we now have the code for dense and dynamic that are always shared, and the one for static and sparse almost always too.
        • Removed support for adding non-EQ clause on the row keys with RP (we'll deal with it in CASSANDRA-3771).
        • Change the behavior of selecting on composite without giving all the components. It'll now select all the record having the given prefix (for which we don't really need CASSANDRA_3237). Ok for adding null support later

        To avoid rebase problem I've also pushed the branch on https://github.com/pcmanus/cassandra/tree/cql3 (6fdf3447b29f36050).
        For the tests, I have them all passing here, but for the 'assert_invalid', I added it to dtests but forgot to push it publicly. I've created https://github.com/riptano/cassandra-dtest/tree/cql3_tests with an updated version of the tests (I only tested cql_test.py on this branch so far).

        Show
        Sylvain Lebresne added a comment - Attached rebased and update patch set. I've mainly: Refactored a bit using the 'builder approach' I talked earlier on. It does simplify a bunch of case. I've removed the CFDefinition.Kind enum and instead keep just two isComposite and isCompact flags. It's equivalent to the previous enum but maybe more clear and in any case we now have the code for dense and dynamic that are always shared, and the one for static and sparse almost always too. Removed support for adding non-EQ clause on the row keys with RP (we'll deal with it in CASSANDRA-3771 ). Change the behavior of selecting on composite without giving all the components. It'll now select all the record having the given prefix (for which we don't really need CASSANDRA_3237). Ok for adding null support later To avoid rebase problem I've also pushed the branch on https://github.com/pcmanus/cassandra/tree/cql3 (6fdf3447b29f36050). For the tests, I have them all passing here, but for the 'assert_invalid', I added it to dtests but forgot to push it publicly. I've created https://github.com/riptano/cassandra-dtest/tree/cql3_tests with an updated version of the tests (I only tested cql_test.py on this branch so far).
        Hide
        Sylvain Lebresne added a comment -

        @Thorsten:

        I assume I can do a column slice using something like

        You assume well.

        I'm finding it very useful to abuse the wide-row column names a little bit to be able to group rows

        I note your objection and use case. But to be clear, I only meant that the current patch don't support secondary indexes on wide rows, not that CQL 3 wouldn't ultimately support them. I just meant that I'd prefer pushing support to a following ticket.

        Show
        Sylvain Lebresne added a comment - @Thorsten: I assume I can do a column slice using something like You assume well. I'm finding it very useful to abuse the wide-row column names a little bit to be able to group rows I note your objection and use case. But to be clear, I only meant that the current patch don't support secondary indexes on wide rows, not that CQL 3 wouldn't ultimately support them. I just meant that I'd prefer pushing support to a following ticket.
        Hide
        Jonathan Ellis added a comment -

        I've removed the CFDefinition.Kind enum and instead keep just two isComposite and isCompact flags

        What is the new equivalent of Kind.static then? !composite && !compact?

        Show
        Jonathan Ellis added a comment - I've removed the CFDefinition.Kind enum and instead keep just two isComposite and isCompact flags What is the new equivalent of Kind.static then? !composite && !compact?
        Hide
        Sylvain Lebresne added a comment - - edited

        What is the new equivalent of Kind.static then? !composite && !compact?

        Yes. But static is just the simplest possible sparse, so we almost never test for that, we just have 2 cases: compact or !compact (the only place were static does differ of sparse is when transforming a query result into a resultSet, where we have to "group" columns for the sparse case).

        Show
        Sylvain Lebresne added a comment - - edited What is the new equivalent of Kind.static then? !composite && !compact? Yes. But static is just the simplest possible sparse, so we almost never test for that, we just have 2 cases: compact or !compact (the only place were static does differ of sparse is when transforming a query result into a resultSet, where we have to "group" columns for the sparse case).
        Hide
        Jonathan Ellis added a comment -

        +1 from me, assuming we get the python driver kinks worked out, and pending Eric's review

        Show
        Jonathan Ellis added a comment - +1 from me, assuming we get the python driver kinks worked out, and pending Eric's review
        Hide
        Eric Evans added a comment -

        +1 from me, assuming we get the python driver kinks worked out, and pending Eric's review

        I'm still poking at it, but I was just about to suggest that we go ahead and commit it now.

        Also, IMO, since this is an explicitly experimental feature, and because it could benefit from rapid iteration, I don't think we should subject it to the usual rules for a freeze. In fact, we should also relax the ticket/review requirement, particular for simple issues that have no bearing on any of the discussed semantics.

        Show
        Eric Evans added a comment - +1 from me, assuming we get the python driver kinks worked out, and pending Eric's review I'm still poking at it, but I was just about to suggest that we go ahead and commit it now. Also, IMO, since this is an explicitly experimental feature, and because it could benefit from rapid iteration, I don't think we should subject it to the usual rules for a freeze. In fact, we should also relax the ticket/review requirement, particular for simple issues that have no bearing on any of the discussed semantics.
        Hide
        Sylvain Lebresne added a comment -

        I'm still poking at it, but I was just about to suggest that we go ahead and commit it now.

        Ok, I'll commit it.

        because it could benefit from rapid iteration, I don't think we should subject it to the usual rules for a freeze

        Agreed on both count. Let's still be careful not to break stuffs too much though: it wouldn't look too nice if this was broken because of a stupid typo or something in the final 1.1, even if that's just a 'developer preview'.

        we should also relax the ticket/review requirement, particular for simple issues that have no bearing on any of the discussed semantics.

        I'm good with committing directly obvious bug fix, but I think I'd prefer we stick with tickets and review for anything bigger. If only because it facilitates synchronization between us. But also because I think reviews help keeping the code cleaner and more bug-free imho, even if that slows slightly the process (and it allows better tracing of the changes).

        What would be nice would be to make every ticket targeting cql 3 sub-task of this ticket. Or at least that we tag them with cql3.

        Show
        Sylvain Lebresne added a comment - I'm still poking at it, but I was just about to suggest that we go ahead and commit it now. Ok, I'll commit it. because it could benefit from rapid iteration, I don't think we should subject it to the usual rules for a freeze Agreed on both count. Let's still be careful not to break stuffs too much though: it wouldn't look too nice if this was broken because of a stupid typo or something in the final 1.1, even if that's just a 'developer preview'. we should also relax the ticket/review requirement, particular for simple issues that have no bearing on any of the discussed semantics. I'm good with committing directly obvious bug fix, but I think I'd prefer we stick with tickets and review for anything bigger. If only because it facilitates synchronization between us. But also because I think reviews help keeping the code cleaner and more bug-free imho, even if that slows slightly the process (and it allows better tracing of the changes). What would be nice would be to make every ticket targeting cql 3 sub-task of this ticket. Or at least that we tag them with cql3.
        Hide
        Eric Evans added a comment -

        Agreed on both count. Let's still be careful not to break stuffs too much though: it wouldn't look too nice if this was broken because of a stupid typo or something in the final 1.1, even if that's just a 'developer preview'.

        Of course.

        I'm good with committing directly obvious bug fix, but I think I'd prefer we stick with tickets and review for anything bigger. If only because it facilitates synchronization between us. But also because I think reviews help keeping the code cleaner and more bug-free imho, even if that slows slightly the process (and it allows better tracing of the changes).

        Right, my thinking was that as of right now, it could benefit more from rapid iteration than from exhaustive review. Obviously, that picture changes as things become more solid and there is less low-hanging fruit.

        Anyway, I think what I'm advocating for is common-sense, and it doesn't sound like we disagree.

        What would be nice would be to make every ticket targeting cql 3 sub-task of this ticket. Or at least that we tag them with cql3.

        Agreed (to both).

        Show
        Eric Evans added a comment - Agreed on both count. Let's still be careful not to break stuffs too much though: it wouldn't look too nice if this was broken because of a stupid typo or something in the final 1.1, even if that's just a 'developer preview'. Of course. I'm good with committing directly obvious bug fix, but I think I'd prefer we stick with tickets and review for anything bigger. If only because it facilitates synchronization between us. But also because I think reviews help keeping the code cleaner and more bug-free imho, even if that slows slightly the process (and it allows better tracing of the changes). Right, my thinking was that as of right now, it could benefit more from rapid iteration than from exhaustive review. Obviously, that picture changes as things become more solid and there is less low-hanging fruit. Anyway, I think what I'm advocating for is common-sense, and it doesn't sound like we disagree. What would be nice would be to make every ticket targeting cql 3 sub-task of this ticket. Or at least that we tag them with cql3. Agreed (to both).
        Hide
        Sylvain Lebresne added a comment -

        Anyway, I think what I'm advocating for is common-sense, and it doesn't sound like we disagree.

        We don't.

        I've committed the patches above. I had to rebase following CASSANDRA-1391 which involved a few changes. I'm attaching those changes as patch 0005 here for reference.

        I'me leaving this ticket open on purpose, so that it stays the main ticket for all the sub-tasks to get this to a 3.0 final.

        Show
        Sylvain Lebresne added a comment - Anyway, I think what I'm advocating for is common-sense, and it doesn't sound like we disagree. We don't. I've committed the patches above. I had to rebase following CASSANDRA-1391 which involved a few changes. I'm attaching those changes as patch 0005 here for reference. I'me leaving this ticket open on purpose, so that it stays the main ticket for all the sub-tasks to get this to a 3.0 final.
        Hide
        Sylvain Lebresne added a comment -

        We're about to ship this in 1.2 so let's close this now.

        Show
        Sylvain Lebresne added a comment - We're about to ship this in 1.2 so let's close this now.

          People

          • Assignee:
            Sylvain Lebresne
            Reporter:
            Sylvain Lebresne
            Reviewer:
            Eric Evans
          • Votes:
            1 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development