Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-12373

3.0 breaks CQL compatibility with super columns families

    Details

    • Type: Bug
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 3.0.x, 3.11.x
    • Component/s: CQL
    • Labels:
      None

      Description

      This is a follow-up to CASSANDRA-12335 to fix the CQL side of super column compatibility.

      The details and a proposed solution can be found in the comments of CASSANDRA-12335 but the crux of the issue is that super column famillies show up differently in CQL in 3.0.x/3.x compared to 2.x, hence breaking backward compatibilty.

        Issue Links

          Activity

          Hide
          ifesdjeen Alex Petrov added a comment - - edited

          I've started collecting information on what needs to be done. I just want to clarify the behaviour first:

          We would like to change the way schema and the resultset are currently represented (instead of the "" map<key_type, value_type> to two actual columns: column<number> (depending on the current clustering key size) and value, just as it was presented in example in CASSANDRA-12335, although preserve their internal representation (internally, map type will still be used for storage).

          In CQL terms

          CREATE TABLE tbl (
          	key ascii,
          	column1 ascii,
          	"" map<int, ascii>,
          	PRIMARY KEY (key, column1))
          	AND COMPACT STORAGE
          

          would return results in form of

           key          | column1 | column2 | value  |
          --------------+---------+---------+--------+
           key1         | val1    | 1       | value1 |
           key1         | val1    | 2       | value2 |
           key1         | val1    | 3       | value3 |
           key1         | val2    | 1       | value1 |
           key1         | val2    | 2       | value2 |
           key1         | val2    | 3       | value3 |
          

          (note that column2 is not clustering as Sylvain Lebresne described in comment).

          And this kind of special-casing will be valid for both read and write paths.

          Show
          ifesdjeen Alex Petrov added a comment - - edited I've started collecting information on what needs to be done. I just want to clarify the behaviour first: We would like to change the way schema and the resultset are currently represented (instead of the "" map<key_type, value_type> to two actual columns: column<number> (depending on the current clustering key size) and value , just as it was presented in example in CASSANDRA-12335 , although preserve their internal representation (internally, map type will still be used for storage). In CQL terms CREATE TABLE tbl ( key ascii, column1 ascii, "" map< int , ascii>, PRIMARY KEY (key, column1)) AND COMPACT STORAGE would return results in form of key | column1 | column2 | value | --------------+---------+---------+--------+ key1 | val1 | 1 | value1 | key1 | val1 | 2 | value2 | key1 | val1 | 3 | value3 | key1 | val2 | 1 | value1 | key1 | val2 | 2 | value2 | key1 | val2 | 3 | value3 | (note that column2 is not clustering as Sylvain Lebresne described in comment). And this kind of special-casing will be valid for both read and write paths.
          Hide
          slebresne Sylvain Lebresne added a comment -

          We would like to change the way schema and the resultset are currently represented

          Actually, we don't want to touch the schema. That is, to be precise, this ticket shouldn't change how anything is stored internally, and shouldn't thus change the schema tables. This does mean that fixing the output of DESCRIBE is actually not a direct part of this ticket, as I believe it's implemented by the python nowadays. We would however encourage drivers to special case super column familes too so that they expose tbl table of your example as:

          CREATE TABLE tbl (
                  key ascii,
                  column1 ascii,
                  column2 int,
                  value ascii,
                  PRIMARY KEY (key, column1, column2)
          ) WITH COMPACT STORAGE;
          

          and that's indeed how we want the table to behave.
          would return results in form of

          would return results in form of

          Yes, that's what we want. But this goes beyond just result-sets, we want the table to behave exactly as if it was the definition from above, namely that we'll allow queries like

          INSERT INTO tbl (key, column1, column2, value) VALUES (...);
          SELECT value FROM tbl WHERE key = 'key1' AND column1 = 'val1' AND column2 = 2;
          

          but we will not allow

          INSERT INTO tbl (key, column1, "") VALUES (....);
          SELECT "" FROM tbl WHERE key = 'key1' AND column1 = 'val1';
          

          In general though, the best description of what we want this ticket to do is that any CQL query on a super column table should behave in 3.0/3.x exactly as it behaved in 2.x. Which highlight the fact that we have no CQL tests for super columns, and a first step could be to write a decent coverage and test it on 2.x. And then we get it to work on 3.0/3.x.

          I'll note that as I said in CASSANDRA-12335, this means we'll probably need to intercept INSERT/UPDATE and SELECT (raw) statements on super column table early and basically rewrite them to match the internal representation, plus post-processing result sets. It would be really nice if we could keep all that code reasonably encapsulated too.

          Show
          slebresne Sylvain Lebresne added a comment - We would like to change the way schema and the resultset are currently represented Actually, we don't want to touch the schema. That is, to be precise, this ticket shouldn't change how anything is stored internally, and shouldn't thus change the schema tables. This does mean that fixing the output of DESCRIBE is actually not a direct part of this ticket, as I believe it's implemented by the python nowadays. We would however encourage drivers to special case super column familes too so that they expose tbl table of your example as: CREATE TABLE tbl ( key ascii, column1 ascii, column2 int, value ascii, PRIMARY KEY (key, column1, column2) ) WITH COMPACT STORAGE; and that's indeed how we want the table to behave. would return results in form of would return results in form of Yes, that's what we want. But this goes beyond just result-sets, we want the table to behave exactly as if it was the definition from above, namely that we'll allow queries like INSERT INTO tbl (key, column1, column2, value) VALUES (...); SELECT value FROM tbl WHERE key = 'key1' AND column1 = 'val1' AND column2 = 2; but we will not allow INSERT INTO tbl (key, column1, "") VALUES (....); SELECT "" FROM tbl WHERE key = 'key1' AND column1 = 'val1'; In general though, the best description of what we want this ticket to do is that any CQL query on a super column table should behave in 3.0/3.x exactly as it behaved in 2.x. Which highlight the fact that we have no CQL tests for super columns, and a first step could be to write a decent coverage and test it on 2.x. And then we get it to work on 3.0/3.x. I'll note that as I said in CASSANDRA-12335 , this means we'll probably need to intercept INSERT/UPDATE and SELECT (raw) statements on super column table early and basically rewrite them to match the internal representation, plus post-processing result sets. It would be really nice if we could keep all that code reasonably encapsulated too.
          Hide
          ifesdjeen Alex Petrov added a comment -

          Actually, we don't want to touch the schema.

          Right. I've tried to fix my wording (you might have seen the edits), but it was still imprecise.

          Thank you for confirming the results format. I'm mostly done with SELECT special-casing, just need to run a bit more tests to make sure that all the cases are covered. Will move to adding 2.x tests and then to INSERT/UPDATE.

          It would be really nice if we could keep all that code reasonably encapsulated too.

          Gladly, most of time we just need a ResultSet, Partition and CFMetaData, so keeping this code aside should not be a big problem. We could do it similar to CompactTables class.

          Show
          ifesdjeen Alex Petrov added a comment - Actually, we don't want to touch the schema. Right. I've tried to fix my wording (you might have seen the edits), but it was still imprecise. Thank you for confirming the results format. I'm mostly done with SELECT special-casing, just need to run a bit more tests to make sure that all the cases are covered. Will move to adding 2.x tests and then to INSERT/UPDATE . It would be really nice if we could keep all that code reasonably encapsulated too. Gladly, most of time we just need a ResultSet , Partition and CFMetaData , so keeping this code aside should not be a big problem. We could do it similar to CompactTables class.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          What we want/need to do re:schema is change python- and java- drivers, but that's about it.

          Show
          iamaleksey Aleksey Yeschenko added a comment - What we want/need to do re:schema is change python- and java- drivers, but that's about it.
          Hide
          jjordan Jeremiah Jordan added a comment -

          Don't forget the "snapshot" schema changing code.

          Show
          jjordan Jeremiah Jordan added a comment - Don't forget the "snapshot" schema changing code.
          Hide
          slebresne Sylvain Lebresne added a comment -

          What we want/need to do re:schema is change python- and java- drivers

          Right, didn't meant nothing has to be done, just that it wasn't part of this patch (nor C* really). I'll mark "client-impacting" in fact, so clients are aware they have something to do here too.

          Don't forget the "snapshot" schema changing code.

          Well, that's kind of an interesting point because there is in fact currently no way to create a super column table from CQL. That's kind of on purpose, we do not want people to create them nowadays, we only want to maintain backward compatibility, but it does mean if you need to restore a snapshot of a super column table, you have to currently use thrift. So I don't known, maybe in the short term saying that we don't save schema for snapshot at all for super column table is the "most honest" solution.

          But it kind of suggest to me that we should get CASSANDRA-10857 in ASAP, and then force users to use it on super column tables before upgrading to 4.0. It's not ideal, but I don't dragging super columns support forever is a better solution either.

          Show
          slebresne Sylvain Lebresne added a comment - What we want/need to do re:schema is change python- and java- drivers Right, didn't meant nothing has to be done, just that it wasn't part of this patch (nor C* really). I'll mark "client-impacting" in fact, so clients are aware they have something to do here too. Don't forget the "snapshot" schema changing code. Well, that's kind of an interesting point because there is in fact currently no way to create a super column table from CQL. That's kind of on purpose, we do not want people to create them nowadays, we only want to maintain backward compatibility, but it does mean if you need to restore a snapshot of a super column table, you have to currently use thrift. So I don't known, maybe in the short term saying that we don't save schema for snapshot at all for super column table is the "most honest" solution. But it kind of suggest to me that we should get CASSANDRA-10857 in ASAP, and then force users to use it on super column tables before upgrading to 4.0. It's not ideal, but I don't dragging super columns support forever is a better solution either.
          Hide
          ifesdjeen Alex Petrov added a comment -

          While discussing CASSANDRA-7190, we decided to opt-out from writing custom schema loading format (and tool) and just CQL, so now we can only load things that are "CQL-expressible". In schema file we do "best effort", so we say "so here's what it kind of looks like internally, but there's actually no way to re-create that in CQL". We may just leave the schema file as is for now.

          Show
          ifesdjeen Alex Petrov added a comment - While discussing CASSANDRA-7190 , we decided to opt-out from writing custom schema loading format (and tool) and just CQL, so now we can only load things that are "CQL-expressible". In schema file we do "best effort", so we say "so here's what it kind of looks like internally, but there's actually no way to re-create that in CQL". We may just leave the schema file as is for now.
          Hide
          ifesdjeen Alex Petrov added a comment - - edited

          Sylvain Lebresne Aleksey Yeschenko do we want to support things like materialized views?

          (sorry about modification storm).

          Show
          ifesdjeen Alex Petrov added a comment - - edited Sylvain Lebresne Aleksey Yeschenko do we want to support things like materialized views? (sorry about modification storm).
          Hide
          ifesdjeen Alex Petrov added a comment -

          I've implemented a preliminary version of the patch (supercolum counters are still in progress):

          trunk dtest utest
          2.2 dtest utest
          dtest patch

          Should we include support for filtering, LWTs and materialized views? This is going to be some additional work because of column mappings / making fake columns available everywhere. Since we're bringing them back mostly to let people migrate to 3.x storage, I'd suggest leaving these three pieces out. What do you think?

          Show
          ifesdjeen Alex Petrov added a comment - I've implemented a preliminary version of the patch (supercolum counters are still in progress): trunk dtest utest 2.2 dtest utest dtest patch Should we include support for filtering, LWTs and materialized views? This is going to be some additional work because of column mappings / making fake columns available everywhere. Since we're bringing them back mostly to let people migrate to 3.x storage, I'd suggest leaving these three pieces out. What do you think?
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Should we include support for filtering, LWTs and materialized views?

          The minimal requirement is to have the functionality present in 2.1.x/2.2.x restored - for people who migrated their supercolumn-using apps from Thrift to CQL and might already be using certain CQL features with those tables in their code. MVs weren't part of it, LWT and filtering - not sure.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Should we include support for filtering, LWTs and materialized views? The minimal requirement is to have the functionality present in 2.1.x/2.2.x restored - for people who migrated their supercolumn-using apps from Thrift to CQL and might already be using certain CQL features with those tables in their code. MVs weren't part of it, LWT and filtering - not sure.
          Hide
          ifesdjeen Alex Petrov added a comment - - edited

          Thank you. LWTs didn't work, filtering didn't work either, although due to different reasons. I'll just add SuperColumn-specific error messages then.

          Show
          ifesdjeen Alex Petrov added a comment - - edited Thank you. LWTs didn't work , filtering didn't work either , although due to different reasons. I'll just add SuperColumn-specific error messages then.
          Hide
          ifesdjeen Alex Petrov added a comment - - edited

          I hope I've covered all the corner cases. A small note on the implementation: currently I'm using hardcoded column names column2 and value for the fake columns. If you think that can lead to some sort of collision or problem, we can try relying on column name generation from CompactTables, although it's going to be more or less same. These columns are not persisted together with the rest of schema, they're completely virtual and created during CFMetadata construction.

          As Sylvain Lebresne mentioned, SelectStatement, UpdateStatement and DeleteStatement were modified to special-case supercolumns, converting from map to two columns and reverse.

          trunk dtest utest
          3.X dtest utest
          2.2 dtest utest
          dtest patch
          Show
          ifesdjeen Alex Petrov added a comment - - edited I hope I've covered all the corner cases. A small note on the implementation: currently I'm using hardcoded column names column2 and value for the fake columns. If you think that can lead to some sort of collision or problem, we can try relying on column name generation from CompactTables , although it's going to be more or less same. These columns are not persisted together with the rest of schema, they're completely virtual and created during CFMetadata construction. As Sylvain Lebresne mentioned, SelectStatement , UpdateStatement and DeleteStatement were modified to special-case supercolumns, converting from map to two columns and reverse. trunk dtest utest 3.X dtest utest 2.2 dtest utest dtest patch
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          These columns are not persisted together with the rest of schema, they're completely virtual and created during CFMetadata construction.

          I do believe that in 2.1 we currently allow renaming these columns in CQL metadata. For users that have switched from Thrift to CQL for their supercolumns and changed from the default names, losing that metadata on upgrade might count as a breaking change.

          Show
          iamaleksey Aleksey Yeschenko added a comment - These columns are not persisted together with the rest of schema, they're completely virtual and created during CFMetadata construction. I do believe that in 2.1 we currently allow renaming these columns in CQL metadata. For users that have switched from Thrift to CQL for their supercolumns and changed from the default names, losing that metadata on upgrade might count as a breaking change.
          Hide
          ifesdjeen Alex Petrov added a comment -

          You're right, I've just checked and it is in fact possible to rename it since it's not a part of primary key.
          I hope that won't change the implementation a whole lot, we just need to pick up the columns during initialisation correctly. I'll run upgrade tests and check what could be the best way.

          Thanks for catching that.

          Show
          ifesdjeen Alex Petrov added a comment - You're right, I've just checked and it is in fact possible to rename it since it's not a part of primary key. I hope that won't change the implementation a whole lot, we just need to pick up the columns during initialisation correctly. I'll run upgrade tests and check what could be the best way. Thanks for catching that.
          Hide
          ifesdjeen Alex Petrov added a comment - - edited

          I've added a possible fix for that. In order to allow column renames, we have to change LegacySchemaMigrator to pass the columns from older versions in case of supercolumn family. In 3.0, however, they're removed from clustering and regular and converted back to fake "virtual" columns (with corresponding KIND and name, in order to avoid them popping up in queries. Already upgraded tables will still work, even though their schema doesn't have the second clustering and compact value columns, as they're initialised as fake columns if missing.

          Show
          ifesdjeen Alex Petrov added a comment - - edited I've added a possible fix for that. In order to allow column renames, we have to change LegacySchemaMigrator to pass the columns from older versions in case of supercolumn family. In 3.0, however, they're removed from clustering and regular and converted back to fake "virtual" columns (with corresponding KIND and name, in order to avoid them popping up in queries. Already upgraded tables will still work, even though their schema doesn't have the second clustering and compact value columns, as they're initialised as fake columns if missing.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Reviewing this next. Can you please rebase for most recent 3.X, just in case? Cheers.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Reviewing this next. Can you please rebase for most recent 3.X, just in case? Cheers.
          Hide
          ifesdjeen Alex Petrov added a comment -

          Rebased all branches, links are updated above.

          Show
          ifesdjeen Alex Petrov added a comment - Rebased all branches, links are updated above.
          Hide
          ifesdjeen Alex Petrov added a comment -

          Previous patch had problems with renames of dense supercf (was a result of my assumption that renames would work
          similar to how they'd normally work in 3.x, although it's possible to rename all 4 columns in dense supercf and pk parts
          in sparse). Thanks Aleksey Yeschenko for clarifying sparse/dense semantics in the context of super-cf.

          Without renames, it was quite simple to sort out the supercolumn families, since the columns were always purely virtual.
          With renames, it got a bit more difficult and several problems appeared:

          • difference between sparse and dense causes trouble during upgrade (since in once case we have multiple
            regular rows, so we can't add the supercf key to regulars, since we won't be able to differentiate it later)
          • denseness calculation for supercf (because of the empty name map column) was causing errors during upgrade

          There were several other smaller things, all covered in the patch.

          3.X dtest utest
          2.2 dtest utest
          dtest patch

          I didn't update trunk because of the 4881d9c308ccd6b5ca70925bf6ebedb70e7705fc

          Show
          ifesdjeen Alex Petrov added a comment - Previous patch had problems with renames of dense supercf (was a result of my assumption that renames would work similar to how they'd normally work in 3.x, although it's possible to rename all 4 columns in dense supercf and pk parts in sparse). Thanks Aleksey Yeschenko for clarifying sparse/dense semantics in the context of super-cf. Without renames, it was quite simple to sort out the supercolumn families, since the columns were always purely virtual. With renames, it got a bit more difficult and several problems appeared: difference between sparse and dense causes trouble during upgrade (since in once case we have multiple regular rows, so we can't add the supercf key to regulars, since we won't be able to differentiate it later) denseness calculation for supercf (because of the empty name map column) was causing errors during upgrade There were several other smaller things, all covered in the patch. 3.X dtest utest 2.2 dtest utest dtest patch I didn't update trunk because of the 4881d9c308ccd6b5ca70925bf6ebedb70e7705fc
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Tests LGTM, feel free to commit them separately - I think it'd be cleaner this way, anyway.

          Sill thinking through some potential edge cases for the main path on 3.X (to be applied to 3.0.X as well).

          Thanks.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Tests LGTM, feel free to commit them separately - I think it'd be cleaner this way, anyway. Sill thinking through some potential edge cases for the main path on 3.X (to be applied to 3.0.X as well). Thanks.
          Hide
          ifesdjeen Alex Petrov added a comment - - edited

          I'll rebase on top of the latest branches as it seems like the patch has gotten a bit out of date.

          Show
          ifesdjeen Alex Petrov added a comment - - edited I'll rebase on top of the latest branches as it seems like the patch has gotten a bit out of date.
          Hide
          ifesdjeen Alex Petrov added a comment -

          Rebased on top of 3.0 and 3.11. Since we're doing this patch in the preparation for 4.0 where there'll be no thrift, supercolumnfamiles or compact tables, we do not need a trunk patch (only removing last bits of supercolumnfamilies and compact tables from code if there are any).

          3.0 3.11 dtest

          3.0 and 3.11 patches are quite similar but not exactly the same. In 3.0 there are fewer tests (due to the missing features) and there was a difference in SelectStatement since processPartitions is called from two places there. Not sure if we needed to abstract/hide it.

          CI results, including upgrade tests look good.

          Show
          ifesdjeen Alex Petrov added a comment - Rebased on top of 3.0 and 3.11 . Since we're doing this patch in the preparation for 4.0 where there'll be no thrift, supercolumnfamiles or compact tables, we do not need a trunk patch (only removing last bits of supercolumnfamilies and compact tables from code if there are any). 3.0 3.11 dtest 3.0 and 3.11 patches are quite similar but not exactly the same. In 3.0 there are fewer tests (due to the missing features) and there was a difference in SelectStatement since processPartitions is called from two places there. Not sure if we needed to abstract/hide it. CI results, including upgrade tests look good.

            People

            • Assignee:
              ifesdjeen Alex Petrov
              Reporter:
              slebresne Sylvain Lebresne
              Reviewer:
              Sylvain Lebresne
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development