Details

      Description

      Local indexes are suitable for low-cardinality data, where spreading the index across the cluster is a Good Thing. However, for high-cardinality data, local indexes require querying most nodes in the cluster even if only a handful of rows is returned.

      1. test-view-data.sh
        3 kB
        Alan Boudreault
      2. users.yaml
        0.8 kB
        Alan Boudreault

        Issue Links

          Activity

          Hide
          benedict Benedict added a comment -

          I'm just wondering if any thought has been given to weird race conditions on schema changes here. It looks like we open the MV by name, so could there not be some weird sequence of drop/create table/MV that results in the cluster thinking the MV is two different CFs until nodes reboot?

          I haven't thought about it much, so it might be benign, it just looks like a race that won't be fixed by the schema improvements Aleksey Yeschenko is working on

          Show
          benedict Benedict added a comment - I'm just wondering if any thought has been given to weird race conditions on schema changes here. It looks like we open the MV by name, so could there not be some weird sequence of drop/create table/MV that results in the cluster thinking the MV is two different CFs until nodes reboot? I haven't thought about it much, so it might be benign, it just looks like a race that won't be fixed by the schema improvements Aleksey Yeschenko is working on
          Hide
          aweisberg Ariel Weisberg added a comment -

          Recommitted with fix. Looks like those errors are fixed in my test run above. But we have so many flappy tests it's hard to tell at the moment. Went from 38 to 22 vs 19 on trunk. Care to double check Ariel Weisberg?

          I agree the flappy tests make it really annoying and time consuming. Right now what I do is for every failing test in the build on my branch is look up the test history on trunk. If it fails on trunk then I ignore it. That's how I got to looking into cassci page load times...

          This is why I am really bummed about all the regressions in utests and dtests. It renders the entire CI process either very labor intensive or useless. But once we work it off and have it clean it's MUCH easier to keep it clean.

          I compared them. It did take 10 minutes There are some things that are passing on trunk due to fixes that came in after the branch was rebased, but not things that look like regressions.

          Show
          aweisberg Ariel Weisberg added a comment - Recommitted with fix. Looks like those errors are fixed in my test run above. But we have so many flappy tests it's hard to tell at the moment. Went from 38 to 22 vs 19 on trunk. Care to double check Ariel Weisberg? I agree the flappy tests make it really annoying and time consuming. Right now what I do is for every failing test in the build on my branch is look up the test history on trunk. If it fails on trunk then I ignore it. That's how I got to looking into cassci page load times... This is why I am really bummed about all the regressions in utests and dtests. It renders the entire CI process either very labor intensive or useless. But once we work it off and have it clean it's MUCH easier to keep it clean. I compared them. It did take 10 minutes There are some things that are passing on trunk due to fixes that came in after the branch was rebased, but not things that look like regressions.
          Hide
          jkni Joel Knighton added a comment -

          Jepsen tests committed. These test multiple writes to keys in the base under a variety of crashes and partitions, during steady state cluster composition, bootstrapping, and decommissioning. For now, they're passing.

          Available at: https://github.com/riptano/jepsen/commit/b389ddcea040e607a85ccc276d397084599c2071

          These may be extended farther.

          Show
          jkni Joel Knighton added a comment - Jepsen tests committed. These test multiple writes to keys in the base under a variety of crashes and partitions, during steady state cluster composition, bootstrapping, and decommissioning. For now, they're passing. Available at: https://github.com/riptano/jepsen/commit/b389ddcea040e607a85ccc276d397084599c2071 These may be extended farther.
          Hide
          aboudreault Alan Boudreault added a comment -

          dtests committed: https://github.com/riptano/cassandra-dtest/pull/360

          Thanks to all your work on this devs.

          Show
          aboudreault Alan Boudreault added a comment - dtests committed: https://github.com/riptano/cassandra-dtest/pull/360 Thanks to all your work on this devs.
          Hide
          tjake T Jake Luciani added a comment -

          Recommitted with fix. Looks like those errors are fixed in my test run above. But we have so many flappy tests it's hard to tell at the moment. Went from 38 to 22 vs 19 on trunk. Care to double check Ariel Weisberg?

          Show
          tjake T Jake Luciani added a comment - Recommitted with fix. Looks like those errors are fixed in my test run above. But we have so many flappy tests it's hard to tell at the moment. Went from 38 to 22 vs 19 on trunk. Care to double check Ariel Weisberg ?
          Hide
          tjake T Jake Luciani added a comment -

          I've pushed a fix which is running:

          http://cassci.datastax.com/view/Dev/view/tjake/job/tjake-ticket-6477-rebase-dtest/
          http://cassci.datastax.com/view/Dev/view/tjake/job/tjake-ticket-6477-rebase-testall/

          Turns out we didn't need these mutations since the drop keyspace calls invalidate on each of the cfs that in turn calls SystemKeyspace.setMaterializedViewRemoved which removes the same rows from those tables.

          Show
          tjake T Jake Luciani added a comment - I've pushed a fix which is running: http://cassci.datastax.com/view/Dev/view/tjake/job/tjake-ticket-6477-rebase-dtest/ http://cassci.datastax.com/view/Dev/view/tjake/job/tjake-ticket-6477-rebase-testall/ Turns out we didn't need these mutations since the drop keyspace calls invalidate on each of the cfs that in turn calls SystemKeyspace.setMaterializedViewRemoved which removes the same rows from those tables.
          Hide
          jbellis Jonathan Ellis added a comment -

          Reverted 3bdcaa336a6e6a9727c333b433bb9f5d3afc0fb1 and b93f05d7d1490c6146576a35f5a572d9d0e72399 pending a fix.

          Show
          jbellis Jonathan Ellis added a comment - Reverted 3bdcaa336a6e6a9727c333b433bb9f5d3afc0fb1 and b93f05d7d1490c6146576a35f5a572d9d0e72399 pending a fix.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          SchemaKeyspace.makeDropKeyspaceMutation() is the problem, in particular these two calls:

          mutation.add(PartitionUpdate.fullPartitionDelete(SystemKeyspace.BuiltMaterializedViews, mutation.key(), timestamp, nowInSec));
          mutation.add(PartitionUpdate.fullPartitionDelete(SystemKeyspace.MaterializedViewsBuildsInProgress, mutation.key(), timestamp, nowInSec));
          

          The problem is that those tables are in a different keyspace (system) and cannot be added to the mutation for system_schema. The fix is to do what I did in CASSANDRA-6717 for built indexes.

          Once it's done, nits:

          • Still see non-underscore-separated MATERIALIZEDVIEWS and "materializeviews" in MV-related table names and constants, now in SystemKeyspace
          • built and in-progress tables can be made private
          • BuiltMaterializedViews formatting is all messed up, and for some reason (copy-paste from BuiltIndexes most likely) double-quotes the table name needlessly
          Show
          iamaleksey Aleksey Yeschenko added a comment - SchemaKeyspace.makeDropKeyspaceMutation() is the problem, in particular these two calls: mutation.add(PartitionUpdate.fullPartitionDelete(SystemKeyspace.BuiltMaterializedViews, mutation.key(), timestamp, nowInSec)); mutation.add(PartitionUpdate.fullPartitionDelete(SystemKeyspace.MaterializedViewsBuildsInProgress, mutation.key(), timestamp, nowInSec)); The problem is that those tables are in a different keyspace ( system ) and cannot be added to the mutation for system_schema . The fix is to do what I did in CASSANDRA-6717 for built indexes. Once it's done, nits: Still see non-underscore-separated MATERIALIZEDVIEWS and "materializeviews" in MV-related table names and constants, now in SystemKeyspace built and in-progress tables can be made private BuiltMaterializedViews formatting is all messed up, and for some reason (copy-paste from BuiltIndexes most likely) double-quotes the table name needlessly
          Hide
          aweisberg Ariel Weisberg added a comment -

          Dropping a keyspace now generates an error in the log and that causes the dtest snapshot_test to fail http://cassci.datastax.com/view/Dev/view/carlyeks/job/carlyeks-ticket-6477-rebase-dtest/3/testReport/junit/snapshot_test/TestSnapshot/test_basic_snapshot_and_restore/

          There appear to be other failures in that suite unrelated to materialized views that might not reproduce locally so you can ignore those.

          Show
          aweisberg Ariel Weisberg added a comment - Dropping a keyspace now generates an error in the log and that causes the dtest snapshot_test to fail http://cassci.datastax.com/view/Dev/view/carlyeks/job/carlyeks-ticket-6477-rebase-dtest/3/testReport/junit/snapshot_test/TestSnapshot/test_basic_snapshot_and_restore/ There appear to be other failures in that suite unrelated to materialized views that might not reproduce locally so you can ignore those.
          Hide
          jkrupan Jack Krupansky added a comment -

          The CQL.textile for MV still shows parentheses around the selection list, which is not the case in SELECT.

          Show
          jkrupan Jack Krupansky added a comment - The CQL.textile for MV still shows parentheses around the selection list, which is not the case in SELECT.
          Hide
          tjake T Jake Luciani added a comment -

          Committed, great work Carl!

          Show
          tjake T Jake Luciani added a comment - Committed, great work Carl!
          Hide
          tjake T Jake Luciani added a comment -

          Opened CASSANDRA-9927 to track

          Show
          tjake T Jake Luciani added a comment - Opened CASSANDRA-9927 to track
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Oh, but it definitely should be, since there will be at least one column from the base table here, and authz works on a white-list basis.

          Actually, hold off on that for now. Need time to think. /cc Sam Tunnicliffe

          Show
          iamaleksey Aleksey Yeschenko added a comment - Oh, but it definitely should be, since there will be at least one column from the base table here, and authz works on a white-list basis. Actually, hold off on that for now. Need time to think. /cc Sam Tunnicliffe
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I think that SelectStatement shouldn't be checking the SELECT on the base table; it's possible for a less restrictive set of values to be present in the MV, so the set of permissions should be accordingly more granular.

          Oh, but it definitely should be, since there will be at least one column from the base table here, and authz works on a white-list basis.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I think that SelectStatement shouldn't be checking the SELECT on the base table; it's possible for a less restrictive set of values to be present in the MV, so the set of permissions should be accordingly more granular. Oh, but it definitely should be, since there will be at least one column from the base table here, and authz works on a white-list basis.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Aleksey Yeschenko: Just pushed a commit to address your nits.

          I think that SelectStatement shouldn't be checking the SELECT on the base table; it's possible for a less restrictive set of values to be present in the MV, so the set of permissions should be accordingly more granular.

          Joshua McKenzie: I removed the check on counter & ttl – was copy-paste accident.

          T Jake Luciani: I added some metrics about the successful vs attempted replicas

          I'm finishing up the changes to the CQL.textile doc; if people are happy with these latest changes, I'll rebase+squash this down to a single commit.

          Show
          carlyeks Carl Yeksigian added a comment - Aleksey Yeschenko : Just pushed a commit to address your nits. I think that SelectStatement shouldn't be checking the SELECT on the base table; it's possible for a less restrictive set of values to be present in the MV, so the set of permissions should be accordingly more granular. Joshua McKenzie : I removed the check on counter & ttl – was copy-paste accident. T Jake Luciani : I added some metrics about the successful vs attempted replicas I'm finishing up the changes to the CQL.textile doc; if people are happy with these latest changes, I'll rebase+squash this down to a single commit.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Also, that's some impressive turn-around time on the fixes. Kudos.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Also, that's some impressive turn-around time on the fixes. Kudos.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Nits:

          per convention, MATERIALIZEDVIEWS should be MATERIALIZED_VIEWS, likewise "materializeviews" -> "materiazlized_views" (schema keyspace table name).

          MaterializedViews table def in SchemaKeyspace is all messed up, alignment-wise (and columnfamily_name must be table_name, even though we'll be reworking that whole table for the beta).

          The materializedViews map in CFMetaData is a regular HashMap and it's being updated in place. It will occasionally cause ConcurrentModificationException. The whole thing should be made immutable like Triggers, as CFMetaData will be so entirely, soon.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Nits: per convention, MATERIALIZEDVIEWS should be MATERIALIZED_VIEWS , likewise "materializeviews" -> "materiazlized_views" (schema keyspace table name). MaterializedViews table def in SchemaKeyspace is all messed up, alignment-wise (and columnfamily_name must be table_name , even though we'll be reworking that whole table for the beta). The materializedViews map in CFMetaData is a regular HashMap and it's being updated in place. It will occasionally cause ConcurrentModificationException . The whole thing should be made immutable like Triggers , as CFMetaData will be so entirely, soon.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Didn't mean to actually add it as a new Flag enum field - only a boolean in CFMetaData (and we want want for isIndex too, instead of the hack with the name, but it's unrelated to the ticket). It being an actual extra thing in the enum is meaningless, since tables in system_schema.tables will never have it, and views in the future system_schema.materialized_views, always will (no need to store it).

          But it's good enough for now (will have to go before CASSANDRA-9921, though).

          Thrift validation is still lacking. In particular system_update_column_family and system_drop_column_family. Some tests by someone would be nice (not alpha-blocking).

          SelectStatement is still not asking for SELECT on the base table for MVs.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Didn't mean to actually add it as a new Flag enum field - only a boolean in CFMetaData (and we want want for isIndex too, instead of the hack with the name, but it's unrelated to the ticket). It being an actual extra thing in the enum is meaningless, since tables in system_schema.tables will never have it, and views in the future system_schema.materialized_views , always will (no need to store it). But it's good enough for now (will have to go before CASSANDRA-9921 , though). Thrift validation is still lacking. In particular system_update_column_family and system_drop_column_family . Some tests by someone would be nice (not alpha-blocking). SelectStatement is still not asking for SELECT on the base table for MVs.
          Hide
          tjake T Jake Luciani added a comment -

          Attaching a link to the not yet complete design doc for this work. I should have it mostly finished by EOD

          Show
          tjake T Jake Luciani added a comment - Attaching a link to the not yet complete design doc for this work. I should have it mostly finished by EOD
          Hide
          boneill Brian ONeill added a comment -

          I'm happy to provide some alpha testing. This will be merged to trunk?

          Show
          boneill Brian ONeill added a comment - I'm happy to provide some alpha testing. This will be merged to trunk?
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment - - edited

          Only feedback I have on the last few commits is very minor: Not sure we need the clause in AlterMaterializedViewStatement to check for meta.isCounter() && TTL since we don't allow counters with MV, do we?

          I don't have any other outstanding concerns w/the patch for an alpha commit at this time. Given the ease and elegance w/which this last round of additions was made I think we're in a pretty good place to respond to feedback from alpha testing on this.

          Edit: T Jake Luciani: Since you asked nicely: "+1"

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - - edited Only feedback I have on the last few commits is very minor: Not sure we need the clause in AlterMaterializedViewStatement to check for meta.isCounter() && TTL since we don't allow counters with MV, do we? I don't have any other outstanding concerns w/the patch for an alpha commit at this time. Given the ease and elegance w/which this last round of additions was made I think we're in a pretty good place to respond to feedback from alpha testing on this. Edit: T Jake Luciani : Since you asked nicely: "+1"
          Hide
          tjake T Jake Luciani added a comment -

          Here is a chart showing there is no impact to non-MV write performance with the current branch as compared to trunk (in case anyone was curious)

          http://cstar.datastax.com/graph?stats=7ae77c9c-3608-11e5-bf9d-42010af0688f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=491.04&ymin=0&ymax=140969.4

          Show
          tjake T Jake Luciani added a comment - Here is a chart showing there is no impact to non-MV write performance with the current branch as compared to trunk (in case anyone was curious) http://cstar.datastax.com/graph?stats=7ae77c9c-3608-11e5-bf9d-42010af0688f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=491.04&ymin=0&ymax=140969.4
          Hide
          tjake T Jake Luciani added a comment -
          Show
          tjake T Jake Luciani added a comment - Here is the latest dtest run FYI (all passing) https://cassci.datastax.com/job/scratch-mv-dtest/9/testReport/materialized_views_test/
          Hide
          carlyeks Carl Yeksigian added a comment -

          I've pushed new commits which address these comments, and I've created CASSANDRA-9920, CASSANDRA-9921, CASSANDRA-9922 for follow up discussions.

          Aleksey Yeschenko Joshua McKenzie: this branch is ready for another review.

          Show
          carlyeks Carl Yeksigian added a comment - I've pushed new commits which address these comments, and I've created CASSANDRA-9920 , CASSANDRA-9921 , CASSANDRA-9922 for follow up discussions. Aleksey Yeschenko Joshua McKenzie : this branch is ready for another review.
          Hide
          carlyeks Carl Yeksigian added a comment -
          • I think it makes sense to add an ALTER MATERIALIZED VIEW statement which allows for changing the options, without allowing the other operations - on par with what is provided with ALTER TABLE for MVs right now. I'll do that and disable using ALTER TABLE with MV's.
          • Adding MATERIALIZEDVIEW to the flags seems like a good idea, and following the isCounter mold here. Moving it to CFMetaData also makes Thrift validation easier.
          • I can add validation for Thrift to disallow modifying the views directly.
          • Definitely need to disallow triggers on MV; they aren't going to work properly. I'm not sure if there is a strong reason to disable indexes on 2i, but if there is, shouldn't be too hard to disallow.
          • It makes sense to move the SELECT permission into the update, like cas; probably also need to ensure MODIFY permission for the MV when updating base
          • This has been following the 2i mainly because it was developed as global indexes, not MV. Might make sense to move to be on par with tables, but they are still intricately related to their underlying base table, so I'm inclined to leave as is for now (but, we should move that discussion to a new ticket before 3.0 GA)
          • Let's move the combining two tables into a follow up ticket dependent on CASSANDRA-9712
          • Let's move this into a follow up ticket which will come between this one and CASSANDRA-9778; might make sense to do with the previous refactor as well
          • cas + MV need to be tested; since the materialized view updates happen inside the Keyspace.apply method, cas should update fine, but we aren't going to be hitting the coordinator batchlog so the failure scenarios may be subtly different.
          • I'll disallow creating view on top of view
          Show
          carlyeks Carl Yeksigian added a comment - I think it makes sense to add an ALTER MATERIALIZED VIEW statement which allows for changing the options, without allowing the other operations - on par with what is provided with ALTER TABLE for MVs right now. I'll do that and disable using ALTER TABLE with MV's. Adding MATERIALIZEDVIEW to the flags seems like a good idea, and following the isCounter mold here. Moving it to CFMetaData also makes Thrift validation easier. I can add validation for Thrift to disallow modifying the views directly. Definitely need to disallow triggers on MV; they aren't going to work properly. I'm not sure if there is a strong reason to disable indexes on 2i, but if there is, shouldn't be too hard to disallow. It makes sense to move the SELECT permission into the update, like cas; probably also need to ensure MODIFY permission for the MV when updating base This has been following the 2i mainly because it was developed as global indexes, not MV. Might make sense to move to be on par with tables, but they are still intricately related to their underlying base table, so I'm inclined to leave as is for now (but, we should move that discussion to a new ticket before 3.0 GA) Let's move the combining two tables into a follow up ticket dependent on CASSANDRA-9712 Let's move this into a follow up ticket which will come between this one and CASSANDRA-9778 ; might make sense to do with the previous refactor as well cas + MV need to be tested; since the materialized view updates happen inside the Keyspace.apply method, cas should update fine, but we aren't going to be hitting the coordinator batchlog so the failure scenarios may be subtly different. I'll disallow creating view on top of view
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          And it seems like you can create a view on another view. The resulting view itself doesn't get updated, though. Validation should reject creating a view on another view.

          Show
          iamaleksey Aleksey Yeschenko added a comment - And it seems like you can create a view on another view. The resulting view itself doesn't get updated, though. Validation should reject creating a view on another view.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -
          if (!metadata.getMaterializedViews().isEmpty())
              throw new InvalidRequestException("cas operations are disallowed on Global Indexed column families");
          

          This looks to me like a pretty severe restriction. Is it temporary?

          Show
          iamaleksey Aleksey Yeschenko added a comment - if (!metadata.getMaterializedViews().isEmpty()) throw new InvalidRequestException( "cas operations are disallowed on Global Indexed column families" ); This looks to me like a pretty severe restriction. Is it temporary?
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I know we already have CASSANDRA-9736, but in the meantime ALTER TABLE should not be allowed to alter MVs (don't want it to be accidentally forgotten).

          Instead of having a ConcurrentHashSet in Schema (named materializedViewList, for some reason), MVness of a table should be a boolean flag, or, better, an enum, in CFMetaData.

          There seems to be no validation on Thrift side against modifying a view directly (think CassandraServer).

          It's possible to create an index, or add a trigger, to an MV. I think the latter would just break, and not sure if we want to ever support the former.

          Security is not handled completely properly. CREATE MATERIALIZED VIEW is requiring SELECT on the base table. It shouldn't. We should be checking for SELECT permission against the base table every query against MVs (a check on CREATE is a one-time check, and then permission can go away at any time). May also look into new permissions altogether for view DDL /cc Sam Tunnicliffe

          We need to think where MVs are on the hierarchy. Are they on the same level as secondary indexes, or on the same level as other tables, UDTs, UDFs, and UDAs? In the latter case, we need new events for the native protocol.

          It would be my strong preference that MVs do not reuse system_schema.tables to store their metadata, even if it means some duplication. (for schema code, after the next CASSANDRA-9712 comment that will factor-out TableParams, the duplication will be less of an issue). The way it's currently done, with a 'link' table, I'm not a fan of. One table with those two merged should be enough.

          We are committed to supporting proper WHERE clause in the future (and CASSANDRA-9778). I expect the necessary schema to support it to be in 3.0 GA, for forward compatibility. Altering schema tables is not exactly trivial, and may force us to have to wait until 4.0 for those features.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I know we already have CASSANDRA-9736 , but in the meantime ALTER TABLE should not be allowed to alter MVs (don't want it to be accidentally forgotten). Instead of having a ConcurrentHashSet in Schema (named materializedViewList , for some reason), MVness of a table should be a boolean flag, or, better, an enum, in CFMetaData . There seems to be no validation on Thrift side against modifying a view directly (think CassandraServer ). It's possible to create an index, or add a trigger, to an MV. I think the latter would just break, and not sure if we want to ever support the former. Security is not handled completely properly. CREATE MATERIALIZED VIEW is requiring SELECT on the base table. It shouldn't. We should be checking for SELECT permission against the base table every query against MVs (a check on CREATE is a one-time check, and then permission can go away at any time). May also look into new permissions altogether for view DDL /cc Sam Tunnicliffe We need to think where MVs are on the hierarchy. Are they on the same level as secondary indexes, or on the same level as other tables, UDTs, UDFs, and UDAs? In the latter case, we need new events for the native protocol. It would be my strong preference that MVs do not reuse system_schema.tables to store their metadata, even if it means some duplication. (for schema code, after the next CASSANDRA-9712 comment that will factor-out TableParams , the duplication will be less of an issue). The way it's currently done, with a 'link' table, I'm not a fan of. One table with those two merged should be enough. We are committed to supporting proper WHERE clause in the future (and CASSANDRA-9778 ). I expect the necessary schema to support it to be in 3.0 GA, for forward compatibility. Altering schema tables is not exactly trivial, and may force us to have to wait until 4.0 for those features.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          The idea with MV repair is we don't actually use the streamed files. We want to just iterate them and pass them through the regular mutation path (which maintains view state as well as applies secondary index updates.)

          It took a little more of me beating my skull against the concept offline but I'm good now.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - The idea with MV repair is we don't actually use the streamed files. We want to just iterate them and pass them through the regular mutation path (which maintains view state as well as applies secondary index updates.) It took a little more of me beating my skull against the concept offline but I'm good now.
          Hide
          tjake T Jake Luciani added a comment -

          The idea with MV repair is we don't actually use the streamed files. We want to just iterate them and pass them through the regular mutation path (which maintains view state as well as applies secondary index updates.)

          So once we are done with them we 'release' them so they are deleted.

          I spoke with Benedict and he suggested rather than explicitly dropping the sstablereader I add them to the supplied transaction then abort it once the mutations are applied. testing it now.

          Show
          tjake T Jake Luciani added a comment - The idea with MV repair is we don't actually use the streamed files. We want to just iterate them and pass them through the regular mutation path (which maintains view state as well as applies secondary index updates.) So once we are done with them we 'release' them so they are deleted. I spoke with Benedict and he suggested rather than explicitly dropping the sstablereader I add them to the supplied transaction then abort it once the mutations are applied. testing it now.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          946e221 -> 61aad0e:

          • The removal of the RowUpdateBuilder usage makes the code both more verbose and less readable. Net negative - I'd prefer we make RowUpdateBuilder adhere to some performance guarantees so we can continue using that abstraction to clean up these implementation details. That being said, not worth blocking this ticket for that effort, so maybe a different ticket would be better.
          • Clarify flow of ref acquisition and release in StreamReceiveTask, even if it's just to comment when it's grabbed vs. when it's released. Right now it looks very odd that we're calling reader.selfRef().release() in the hasMaterializedViews path and we're not doing that in the else path. That implies to me that we're taking a ref (or not removing a ref) that we aren't in the else clause but the code doesn't read as such to me. The ISSTableScanner doesn't seem to grab a ref in the BigTableScanner impl (and should release it in .close() if it does since it's AutoClosable), so it looks like we just have an extra selfRef().release() in the hasMaterializedViews path that's unnecessary. I could be missing something, but I expected to see something like the following for write-path repair on a mutation that impacts a CF w/MV's:
            //We have a special path for Materialized view.
            //Since the MV requires cleaning up any pre-existing state, we must put
            //All partitions through the same write path as normal mutations.
            if (hasMaterializedViews)
            {
               for (SSTableReader reader : readers)
               {
                  try (ISSTableScanner scanner = reader.getScanner())
                  {
                     while (scanner.hasNext())
                     {
                        try (UnfilteredRowIterator rowIterator = scanner.next())
                        {
                           new Mutation(PartitionUpdate.fromIterator(rowIterator)).apply();
                        }
                     }
                  }
               }
            }
            
            task.txn.finish();
            task.sstables.clear();
            
            try (Refs<SSTableReader> refs = Refs.ref(readers))
            {
                  // add sstables and build secondary indexes
                  cfs.addSSTables(readers);
                  cfs.indexManager.maybeBuildSecondaryIndexes(readers, cfs.indexManager.allIndexesNames());
            }
            
          • To follow on the above, in StreamReceiveTask.run, we don't add sstables or cfs.indexManager.maybeBuildSecondaryIndexes on the hasMaterializedViews path. Is that an oversight or am I just missing something fundamental here?
          • Nit: MaterializedViewTest, line 631 - dead code left in
          • Nit: Some unused imports

          Do we have / could we run a code-coverage report for the new MV code that's added? I'm specifically interested in MaterializedView* and the contents of TemporalRow, especially given the amount of changes you two had to put in post CASSANDRA-9705.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - 946e221 -> 61aad0e: The removal of the RowUpdateBuilder usage makes the code both more verbose and less readable. Net negative - I'd prefer we make RowUpdateBuilder adhere to some performance guarantees so we can continue using that abstraction to clean up these implementation details. That being said, not worth blocking this ticket for that effort, so maybe a different ticket would be better. Clarify flow of ref acquisition and release in StreamReceiveTask, even if it's just to comment when it's grabbed vs. when it's released. Right now it looks very odd that we're calling reader.selfRef().release() in the hasMaterializedViews path and we're not doing that in the else path. That implies to me that we're taking a ref (or not removing a ref) that we aren't in the else clause but the code doesn't read as such to me. The ISSTableScanner doesn't seem to grab a ref in the BigTableScanner impl (and should release it in .close() if it does since it's AutoClosable), so it looks like we just have an extra selfRef().release() in the hasMaterializedViews path that's unnecessary. I could be missing something, but I expected to see something like the following for write-path repair on a mutation that impacts a CF w/MV's: //We have a special path for Materialized view. //Since the MV requires cleaning up any pre-existing state, we must put //All partitions through the same write path as normal mutations. if (hasMaterializedViews) { for (SSTableReader reader : readers) { try (ISSTableScanner scanner = reader.getScanner()) { while (scanner.hasNext()) { try (UnfilteredRowIterator rowIterator = scanner.next()) { new Mutation(PartitionUpdate.fromIterator(rowIterator)).apply(); } } } } } task.txn.finish(); task.sstables.clear(); try (Refs<SSTableReader> refs = Refs.ref(readers)) { // add sstables and build secondary indexes cfs.addSSTables(readers); cfs.indexManager.maybeBuildSecondaryIndexes(readers, cfs.indexManager.allIndexesNames()); } To follow on the above, in StreamReceiveTask.run, we don't add sstables or cfs.indexManager.maybeBuildSecondaryIndexes on the hasMaterializedViews path. Is that an oversight or am I just missing something fundamental here? Nit: MaterializedViewTest, line 631 - dead code left in Nit: Some unused imports Do we have / could we run a code-coverage report for the new MV code that's added? I'm specifically interested in MaterializedView* and the contents of TemporalRow, especially given the amount of changes you two had to put in post CASSANDRA-9705 .
          Hide
          benedict Benedict added a comment -

          On second thoughts, I don't think that approach will suffice. Since the second write order group is started arbitrarily long after the first, issuing barriers against both offers no guarantees. It only shrinks the window of exposure to the problem.

          I think the simplest and safest thing is to just make the write order global.

          Show
          benedict Benedict added a comment - On second thoughts, I don't think that approach will suffice. Since the second write order group is started arbitrarily long after the first, issuing barriers against both offers no guarantees. It only shrinks the window of exposure to the problem. I think the simplest and safest thing is to just make the write order global.
          Hide
          benedict Benedict added a comment -

          Well, I guess we know upfront if the table has an MV, so we only need to do it then. We need to comment the faeces out of this, though, so nobody screws this up later. It's a very specific relationship we'll probably forget.

          Show
          benedict Benedict added a comment - Well, I guess we know upfront if the table has an MV, so we only need to do it then. We need to comment the faeces out of this, though, so nobody screws this up later. It's a very specific relationship we'll probably forget.
          Hide
          tjake T Jake Luciani added a comment - - edited

          Another option is to always mark the batchlog "blocking" whenever you mark another keyspace blocking.

          That sounds simple enough. Do you mean all keyspaces? Or just mutations that require an MV update?

          Show
          tjake T Jake Luciani added a comment - - edited Another option is to always mark the batchlog "blocking" whenever you mark another keyspace blocking. That sounds simple enough. Do you mean all keyspaces? Or just mutations that require an MV update?
          Hide
          benedict Benedict added a comment -

          The writeOrder is Keyspace-wide, so 2is pass the same writeOrder into their updates as was used for the outer update, yeah.

          Won't comment on the best way to do this with batchlog, given it's a different keyspace. One possible solution is to make the writeOrder completely global. There isn't actually anything terrifyingly bad about doing that, although it would be preferable not to. Another option is to always mark the batchlog "blocking" whenever you mark another keyspace blocking. There is perhaps another option that simply separates the two so they aren't so directly dependent.

          Show
          benedict Benedict added a comment - The writeOrder is Keyspace-wide, so 2is pass the same writeOrder into their updates as was used for the outer update, yeah. Won't comment on the best way to do this with batchlog, given it's a different keyspace. One possible solution is to make the writeOrder completely global. There isn't actually anything terrifyingly bad about doing that, although it would be preferable not to. Another option is to always mark the batchlog "blocking" whenever you mark another keyspace blocking. There is perhaps another option that simply separates the two so they aren't so directly dependent.
          Hide
          tjake T Jake Luciani added a comment - - edited

          We could also do with some comments around things like

          Yeah I've split it out to two methods locally.

          OK, so we have deadlock in that case.

          How do secondary index updates work then? We take a write oporder for another table inside the base table update.

          Show
          tjake T Jake Luciani added a comment - - edited We could also do with some comments around things like Yeah I've split it out to two methods locally. OK, so we have deadlock in that case. How do secondary index updates work then? We take a write oporder for another table inside the base table update.
          Hide
          benedict Benedict added a comment - - edited

          OK, so we have deadlock in that case. The apply of the batchlog happens on a separate writeOrder group. The earlier group could be marked as "blocking" in which case it is permitted to take any necessary action to terminate, but a later (or even simply different keyspace) group is not marked blocking and we depend on it, then the system will lock up.

          We could also do with some comments around things like:

                  if (stage != Stage.MATERIALIZED_VIEW_MUTATION)
                  {
                      for (WriteResponseHandlerWrapper wrapper : wrappers)
                          wrapper.handler.get();
                  }
          

          Without any context, it's kind of hard to spot let alone understand the important semantic difference and why it is there.

          In general, I'd prefer more comments around this feature.

          Show
          benedict Benedict added a comment - - edited OK, so we have deadlock in that case. The apply of the batchlog happens on a separate writeOrder group. The earlier group could be marked as "blocking" in which case it is permitted to take any necessary action to terminate, but a later (or even simply different keyspace) group is not marked blocking and we depend on it, then the system will lock up. We could also do with some comments around things like: if (stage != Stage.MATERIALIZED_VIEW_MUTATION) { for (WriteResponseHandlerWrapper wrapper : wrappers) wrapper.handler.get(); } Without any context, it's kind of hard to spot let alone understand the important semantic difference and why it is there. In general, I'd prefer more comments around this feature.
          Hide
          tjake T Jake Luciani added a comment -

          Had a quick browse, and I found that writeOrder is now wrapping multiple synchronous network operations.

          No synchronous network operations, only async. The write to the batchlog is local only.

          I would split it out anyway but we want to ensure we post to the commit log and batchlog before we apply to the base memtable.

          Show
          tjake T Jake Luciani added a comment - Had a quick browse, and I found that writeOrder is now wrapping multiple synchronous network operations. No synchronous network operations, only async. The write to the batchlog is local only. I would split it out anyway but we want to ensure we post to the commit log and batchlog before we apply to the base memtable.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          From e622f9d to bad01d2 look good w/the caveat that f906240 is going to be reviewed on CASSANDRA-9859

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - From e622f9d to bad01d2 look good w/the caveat that f906240 is going to be reviewed on CASSANDRA-9859
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -
          0d77f03: lgtm
          2707065
          • Should we formalize the "MV threads should be == less of readers or writers" in the configuration code and warn if it's set incorrectly?
          • Do we have any data around the # of batchlog writing threads as to why 32 is a good #?
          8eaa95e: lgtm
          b4d3d99:
          • In MaterializedViewTest, can have updateMV not return anything since nobody's using return which makes the driver ResultSet inclusion unnecessary.
          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - 0d77f03: lgtm 2707065 Should we formalize the "MV threads should be == less of readers or writers" in the configuration code and warn if it's set incorrectly? Do we have any data around the # of batchlog writing threads as to why 32 is a good #? 8eaa95e: lgtm b4d3d99: In MaterializedViewTest, can have updateMV not return anything since nobody's using return which makes the driver ResultSet inclusion unnecessary.
          Hide
          benedict Benedict added a comment - - edited

          So, I've been mulling on this, and I think we can safely guarantee eventual consistency even without a write-path repair. The question is only timeliness, however this can be problematic with or without write-path repair, since DELETEs and UPDATEs for the same operation are not replicated in a coordinated fashion.

          Timeliness / Consistency Caveats

          If, for instance, three different updates are sent at once, and each base-replica receives a different update, each may be propagated onto two different nodes via repair, giving 9 nodes with live data. Eventually one more base-replica receives each of the original updates, and issue deletes for their data. So we have 6 nodes with live data, 3 with tombstones. The batchlog is now happy, however the system is in an inconsistent state until either the MV replicas are repaired again or - with write-path-repair - the base-replicas are.

          Without write-path repair we may be more prone to this problem, but I don't think dramatically more, although I haven't the time to think it through exhaustively. AFAICT, repair must necessarily be able to introduce inconsistency that can only be resolved by another repair (which itself can, of course, introduce more inconsistency).

          I'm pretty sure there are a multiplicity of similar scenarios, and certainly there are less extreme scenarios. Two competing updates and one repair are enough, so long as it's the "wrong" update repaired, and to the "wrong" MV replica it's repaired to.

          Correctness Caveats

          There are also other problems unique to MV: if we lose any two base-replicas (which with vnodes means any two nodes), we can be left with ghost records that are never purged. So any concurrent loss of two nodes means we really need to truncate the MV and rebuild, or we need a special truncation record to truncate only the portion that was owned by those two nodes. This happens for any updates that were received only by those two nodes, but were then proxied on (or written to their batchlogs). This can of course affect any normal QUORUM updates to the cluster, the difference being that the user has no control over these, and simply resending your update to the cluster does not resolve the problem as it would in the current world. Users performing updates to single partitions that would have never been affected by this now also have this to worry about.

          Confidence

          Certainly, I think we need to a bit of formal analysis or simulation of what the possible cluster states are. Ideally a simple model of how each piece of infrastructure works could be constructed in a single process to run the equivalent of years of operations a normal cluster would execute, to explore the levels of "badness" we can expect. That's just my opinion, but I think it would be invaluable, because after spending some spare time thinking about these problems, I think it is a very hard thing to do, and I would rather not trust our feelings about correctness.

          Multiple Columns

          As far as multiple columns are concerned: I think we may need to go back to the drawing board there. It's actually really easy to demonstrate the cluster getting into broken states. Say you have three columns, A B C, and you send three competing updates a b c to their respective columns; previously all held the value _. If they arrive in different orders on each base-replica we can end up with 6 different MV states around the cluster. If any base replica dies, you don't know which of those 6 intermediate states were taken (and probably replicated) by its MV replicas. This problem grows exponentially as you add "competing" updates (which, given split brain, can compete over arbitrarily long intervals).

          This is where my concern about a "single (base) node" dependency comes in, but after consideration it's clear that with a single column this problem is avoided because it's never ambiguous what the old state was. If you encounter a mutation that is shadowed by your current data, you can always issue a delete for the correct prior state. With multiple columns that is no longer possible.

          I'm pretty sure the presence of multiple columns introduces other issues with each of the other moving parts.

          Important Implementation Detail

          Had a quick browse, and I found that writeOrder is now wrapping multiple synchronous network operations. OpOrders are only intended to wrap local operations, so this should be rejigged to avoid locking up the cluster.

          TL;DR

          Anyway, hopefully that wall of text isn't too unappetizing, and is somewhat helpful. To summarize my thoughts, I think the following things are worthy of due consideration and potentially highlighting to the user:

          • Availability:
            • RF-1 node failures cause parts of the MV to receive NO updates, and remain incapable of responding correctly to any queries on their portion (this will apply unevenly for a given base update, meaning multiple generations of value could persist concurrently in the cluster)
            • Any node loss results in a significantly larger hit to the consistency of the MV (in my example, 20% loss of QUORUM to cluster resulted in 57% loss to MV)
            • Both of these are potentially avoidable by ensuring we try another node if "ours" is down, but due consideration needs to be given to if this potentially results in more cluster inconsistencies
          • Repair seems to require possibility of introducing inconsistent cluster states that can only be repaired by repair (which introduces more such states at the same time), resulting in potentially lengthy inconsistencies, or repair frequency greater than can operationally be managed rught now
          • Loss of any two nodes in a vnode cluster can result in permanent inconsistency
          • Have we spotted all of the caveats?
          • Rejig writeOrder usage
          • Multiple columns need a lot more thought
          Show
          benedict Benedict added a comment - - edited So, I've been mulling on this, and I think we can safely guarantee eventual consistency even without a write-path repair. The question is only timeliness, however this can be problematic with or without write-path repair, since DELETEs and UPDATEs for the same operation are not replicated in a coordinated fashion. Timeliness / Consistency Caveats If, for instance, three different updates are sent at once, and each base-replica receives a different update, each may be propagated onto two different nodes via repair, giving 9 nodes with live data. Eventually one more base-replica receives each of the original updates, and issue deletes for their data. So we have 6 nodes with live data, 3 with tombstones. The batchlog is now happy, however the system is in an inconsistent state until either the MV replicas are repaired again or - with write-path-repair - the base-replicas are. Without write-path repair we may be more prone to this problem, but I don't think dramatically more, although I haven't the time to think it through exhaustively. AFAICT, repair must necessarily be able to introduce inconsistency that can only be resolved by another repair (which itself can, of course, introduce more inconsistency). I'm pretty sure there are a multiplicity of similar scenarios, and certainly there are less extreme scenarios. Two competing updates and one repair are enough, so long as it's the "wrong" update repaired, and to the "wrong" MV replica it's repaired to. Correctness Caveats There are also other problems unique to MV: if we lose any two base-replicas (which with vnodes means any two nodes), we can be left with ghost records that are never purged. So any concurrent loss of two nodes means we really need to truncate the MV and rebuild, or we need a special truncation record to truncate only the portion that was owned by those two nodes. This happens for any updates that were received only by those two nodes, but were then proxied on (or written to their batchlogs). This can of course affect any normal QUORUM updates to the cluster, the difference being that the user has no control over these, and simply resending your update to the cluster does not resolve the problem as it would in the current world. Users performing updates to single partitions that would have never been affected by this now also have this to worry about. Confidence Certainly, I think we need to a bit of formal analysis or simulation of what the possible cluster states are. Ideally a simple model of how each piece of infrastructure works could be constructed in a single process to run the equivalent of years of operations a normal cluster would execute, to explore the levels of "badness" we can expect. That's just my opinion, but I think it would be invaluable, because after spending some spare time thinking about these problems, I think it is a very hard thing to do , and I would rather not trust our feelings about correctness. Multiple Columns As far as multiple columns are concerned: I think we may need to go back to the drawing board there. It's actually really easy to demonstrate the cluster getting into broken states. Say you have three columns, A B C, and you send three competing updates a b c to their respective columns; previously all held the value _. If they arrive in different orders on each base-replica we can end up with 6 different MV states around the cluster. If any base replica dies, you don't know which of those 6 intermediate states were taken (and probably replicated) by its MV replicas. This problem grows exponentially as you add "competing" updates (which, given split brain, can compete over arbitrarily long intervals). This is where my concern about a "single (base) node" dependency comes in, but after consideration it's clear that with a single column this problem is avoided because it's never ambiguous what the old state was. If you encounter a mutation that is shadowed by your current data, you can always issue a delete for the correct prior state. With multiple columns that is no longer possible. I'm pretty sure the presence of multiple columns introduces other issues with each of the other moving parts. Important Implementation Detail Had a quick browse, and I found that writeOrder is now wrapping multiple synchronous network operations. OpOrders are only intended to wrap local operations, so this should be rejigged to avoid locking up the cluster. TL;DR Anyway, hopefully that wall of text isn't too unappetizing, and is somewhat helpful. To summarize my thoughts, I think the following things are worthy of due consideration and potentially highlighting to the user: Availability: RF-1 node failures cause parts of the MV to receive NO updates, and remain incapable of responding correctly to any queries on their portion (this will apply unevenly for a given base update, meaning multiple generations of value could persist concurrently in the cluster) Any node loss results in a significantly larger hit to the consistency of the MV (in my example, 20% loss of QUORUM to cluster resulted in 57% loss to MV) Both of these are potentially avoidable by ensuring we try another node if "ours" is down, but due consideration needs to be given to if this potentially results in more cluster inconsistencies Repair seems to require possibility of introducing inconsistent cluster states that can only be repaired by repair (which introduces more such states at the same time), resulting in potentially lengthy inconsistencies, or repair frequency greater than can operationally be managed rught now Loss of any two nodes in a vnode cluster can result in permanent inconsistency Have we spotted all of the caveats? Rejig writeOrder usage Multiple columns need a lot more thought
          Hide
          benedict Benedict added a comment -

          If you loose any 2 nodes with vnodes you can't achieve quorum anyway. I

          You're right, of course. Unfortunately, this only makes the situation worse. I guess since, as you say, we are physically incapable of reaching QUORUM for a portion, it perhaps doesn't matter if we significantly increase that portion for MVs, since there is always a portion for which that property holds. However it may be significantly worse, and in fact we will not deliver some MV updates to any node with only two nodes failing.

          Let's say we have a cluster:

          • With N nodes
          • With R replication factor
          • With 2 failing nodes
          • With infinite vnodes (to simplify calculations)
          • With F(1) ratio of token ranges overlapping between failed nodes, i.e. that cannot reach quorum
          • With F(2) ratio of token ranges involving exactly one failed node in the base table

          Now, when serving a write there are multiple scenarios:

          • Of the F(1) writes, only one base node receives an update; 2/N of any update generated by this node would be routed to one of the now gone nodes. So NO nodes receive ~ 2F(1)/N MV updates.
          • Of the F(2) writes, 2/N MV updates will target a dead node
          • Of the remaining 1 - F(1) - F(2) writes, F(1) will be incapable of reaching QUORUM for the same reason the base table could not

          Now, to derive F(1) and F(2) approximately, let's fix some basic cluster numbers: N=6, R=3. In this scenario F(1) is somewhere between 1/5 and 1/4. F(2) is approximately 4/9.

          So, using the lower bound of F(1), we have:

          • 1/15 of writes reach no MV node whatsoever
          • 4/27 of do not reach QUORUM within the MV because they target a dead node (there are two such writes, so ~27% of all writes)
          • (16/45)*(1/5)=16/225 cannot reach QUORUM at the MV (there are two such writes, so ~17% of all writes)

          There are two of each MV update (delete + insert), so we have 27%+17%=44% of writes failing to reach quorum, and 13% of writes failing to reach anyone. Vs 20% of writes we would expect to not reach QUORUM, and 0% of writes to fail to reach anyone.

          Either way, at the very least we need to ensure we repair our MV portion from our base replicas before we exit the JOINING state (or whatever the state is where we do not serve reads). Otherwise QUORUM will move backwards in time once a node comes back online.

          Now, I'll grant I'm very rusty on these maths, so there are no doubt errors even in my simplification, but I think they represent the main thrust of the problem.

          Show
          benedict Benedict added a comment - If you loose any 2 nodes with vnodes you can't achieve quorum anyway. I You're right, of course. Unfortunately, this only makes the situation worse. I guess since, as you say, we are physically incapable of reaching QUORUM for a portion, it perhaps doesn't matter if we significantly increase that portion for MVs, since there is always a portion for which that property holds. However it may be significantly worse, and in fact we will not deliver some MV updates to any node with only two nodes failing. Let's say we have a cluster: With N nodes With R replication factor With 2 failing nodes With infinite vnodes (to simplify calculations) With F(1) ratio of token ranges overlapping between failed nodes, i.e. that cannot reach quorum With F(2) ratio of token ranges involving exactly one failed node in the base table Now, when serving a write there are multiple scenarios: Of the F(1) writes, only one base node receives an update; 2/N of any update generated by this node would be routed to one of the now gone nodes. So NO nodes receive ~ 2F(1)/N MV updates. Of the F(2) writes, 2/N MV updates will target a dead node Of the remaining 1 - F(1) - F(2) writes, F(1) will be incapable of reaching QUORUM for the same reason the base table could not Now, to derive F(1) and F(2) approximately, let's fix some basic cluster numbers: N=6, R=3. In this scenario F(1) is somewhere between 1/5 and 1/4. F(2) is approximately 4/9. So, using the lower bound of F(1), we have: 1/15 of writes reach no MV node whatsoever 4/27 of do not reach QUORUM within the MV because they target a dead node (there are two such writes, so ~27% of all writes) (16/45)*(1/5)=16/225 cannot reach QUORUM at the MV (there are two such writes, so ~17% of all writes) There are two of each MV update (delete + insert), so we have 27%+17%=44% of writes failing to reach quorum, and 13% of writes failing to reach anyone. Vs 20% of writes we would expect to not reach QUORUM, and 0% of writes to fail to reach anyone. Either way, at the very least we need to ensure we repair our MV portion from our base replicas before we exit the JOINING state (or whatever the state is where we do not serve reads). Otherwise QUORUM will move backwards in time once a node comes back online. Now, I'll grant I'm very rusty on these maths, so there are no doubt errors even in my simplification, but I think they represent the main thrust of the problem.
          Hide
          tjake T Jake Luciani added a comment -

          so, if we lose any two nodes, one of those nodes will be a base replica, and the other will be an MV replica that is paired with one of the other base replicas for one of the token ranges in the cluster

          Keep in mind. If you loose any 2 nodes with vnodes you can't achieve quorum anyway. In anycase this leads back to the whole availibility story I mentioned earlier (why doing sync writes won't work) So we have batchlogs that I do believe get us back to consistency once they replay.

          Show
          tjake T Jake Luciani added a comment - so, if we lose any two nodes, one of those nodes will be a base replica, and the other will be an MV replica that is paired with one of the other base replicas for one of the token ranges in the cluster Keep in mind. If you loose any 2 nodes with vnodes you can't achieve quorum anyway. In anycase this leads back to the whole availibility story I mentioned earlier (why doing sync writes won't work) So we have batchlogs that I do believe get us back to consistency once they replay.
          Hide
          tjake T Jake Luciani added a comment -

          Talking of that pairing, do we properly take care of topology changes?

          Yes, the SP write performers includes pending nodes. I may have broken this with my performance patch so lemme double check.

          Show
          tjake T Jake Luciani added a comment - Talking of that pairing, do we properly take care of topology changes? Yes, the SP write performers includes pending nodes. I may have broken this with my performance patch so lemme double check.
          Hide
          benedict Benedict added a comment - - edited

          I must admit, I'm becoming less and less convinced by the idea (admittedly my own) of proxying on to only one node. From an availability perspective, it seems very likely to induce a persistent mismatch between the base and MV replicas. We only need two node failures anywhere in the cluster, and we pretty much guarantee that portions of the base table and the MV begin to diverge at QUORUM with vnodes (even without vnodes, the chance is 1/RF).

          Since every node is a base replica and an MV replica, with vnodes we are likely to be replicating portions of our share of the base table on to every other node in the cluster as an MV replica:

          • for any single token range we will have picked a single node A to share with;
          • however, we cannot be certain that for every token range that replicates to A we will select A, as this would require a great deal of cooperation and forward planning (if it is even possible at all; I'm not sure we can guarantee it without incorporating it into a token allocation strategy, but I haven't thought about it extensively)
          • as such, for any node we must assume it replicates to vnode distinct nodes (actually a little fewer, ala birthday paradox, but for simplicity let's assume the worst), which typically will mean the whole cluster
          • so, if we lose any two nodes, one of those nodes will be a base replica, and the other will be an MV replica that is paired with one of the other base replicas for one of the token ranges in the cluster
          • so, we will reach QUORUM for the base replica, but not for the affected MV replica token range

          edit: this is because the base replica that's failed obviously won't proxy on the operation, but the failed MV replica will obviously fail to receive it; they're disjoint, so we only have one MV replica receiving the data. One more failure in the cluster (of vnodes) and I think we're actually pretty darn likely (I haven't thought the maths through exactly) to have around 1/RF^2 of the token ranges fail to receive any of their updates, despite reaching QUORUM on the base table.

          Show
          benedict Benedict added a comment - - edited I must admit, I'm becoming less and less convinced by the idea (admittedly my own) of proxying on to only one node. From an availability perspective, it seems very likely to induce a persistent mismatch between the base and MV replicas. We only need two node failures anywhere in the cluster, and we pretty much guarantee that portions of the base table and the MV begin to diverge at QUORUM with vnodes (even without vnodes, the chance is 1/RF). Since every node is a base replica and an MV replica, with vnodes we are likely to be replicating portions of our share of the base table on to every other node in the cluster as an MV replica: for any single token range we will have picked a single node A to share with; however, we cannot be certain that for every token range that replicates to A we will select A, as this would require a great deal of cooperation and forward planning (if it is even possible at all; I'm not sure we can guarantee it without incorporating it into a token allocation strategy, but I haven't thought about it extensively) as such, for any node we must assume it replicates to vnode distinct nodes (actually a little fewer, ala birthday paradox, but for simplicity let's assume the worst), which typically will mean the whole cluster so, if we lose any two nodes, one of those nodes will be a base replica, and the other will be an MV replica that is paired with one of the other base replicas for one of the token ranges in the cluster so, we will reach QUORUM for the base replica, but not for the affected MV replica token range edit: this is because the base replica that's failed obviously won't proxy on the operation, but the failed MV replica will obviously fail to receive it; they're disjoint, so we only have one MV replica receiving the data. One more failure in the cluster (of vnodes) and I think we're actually pretty darn likely (I haven't thought the maths through exactly) to have around 1/RF^2 of the token ranges fail to receive any of their updates, despite reaching QUORUM on the base table.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Also we should expose the base -> view replica pairs so operators know which ones need to be rebuilt.

          Talking of that pairing, do we properly take care of topology changes? That is, when the ownership of some of the range is moving from a MV replica A to some other not B (that is bootstrapping for instance), do we make sure to send the MV update to both node A and B (and we probably need to make such check on each replay of the local batch log)?

          Show
          slebresne Sylvain Lebresne added a comment - Also we should expose the base -> view replica pairs so operators know which ones need to be rebuilt. Talking of that pairing, do we properly take care of topology changes? That is, when the ownership of some of the range is moving from a MV replica A to some other not B (that is bootstrapping for instance), do we make sure to send the MV update to both node A and B (and we probably need to make such check on each replay of the local batch log)?
          Hide
          benedict Benedict added a comment -

          We'll need a very special kind of truncation though, as they're only paired 1:1 for a given token range.

          Show
          benedict Benedict added a comment - We'll need a very special kind of truncation though, as they're only paired 1:1 for a given token range.
          Hide
          tjake T Jake Luciani added a comment -




          Since the base and view replicas are paired 1:1 then you only need to rebuild a single replica on corruption of the base or the view. Yes, this would need a local truncate of the view on the one node you plan to rebuild. Also we should expose the base -> view replica pairs so operators know which ones need to be rebuilt.

          Show
          tjake T Jake Luciani added a comment - Since the base and view replicas are paired 1:1 then you only need to rebuild a single replica on corruption of the base or the view. Yes, this would need a local truncate of the view on the one node you plan to rebuild. Also we should expose the base -> view replica pairs so operators know which ones need to be rebuilt.
          Hide
          benedict Benedict added a comment -

          I don't think it's that simple. The MV data is spread across the whole cluster. How do you know where the ghost records are, to delete them? You don't know what the prior state was, so you can only send the current records, not remove the old ones. Do you introduce a special "truncate" record for primary key portions that match to our tokens? Do you rebuild the whole MV? Or am I missing an obvious trick?

          Show
          benedict Benedict added a comment - I don't think it's that simple. The MV data is spread across the whole cluster. How do you know where the ghost records are, to delete them? You don't know what the prior state was, so you can only send the current records, not remove the old ones. Do you introduce a special "truncate" record for primary key portions that match to our tokens? Do you rebuild the whole MV? Or am I missing an obvious trick?
          Hide
          tjake T Jake Luciani added a comment -

          If you lose data on the base or the view replica you will need to rebuild the corresponding MV for just that node (much like secondary indexes)

          This isn't hooked up in the patch. But the builder exists so I think it's a minor issue.

          Show
          tjake T Jake Luciani added a comment - If you lose data on the base or the view replica you will need to rebuild the corresponding MV for just that node (much like secondary indexes) This isn't hooked up in the patch. But the builder exists so I think it's a minor issue.
          Hide
          benedict Benedict added a comment -

          The MVs are effectively split up into RF groups, each group being maintained by a single base replica. That base replica determines what the current state of those MV replicas are, so if it is lost, it isn't clear what deletes need to be sent to those nodes to bring them inline with whatever state we stream to its replacement. I don't suggest this isn't solvable, but it is another aspect we haven't yet considered AFAICT.

          Show
          benedict Benedict added a comment - The MVs are effectively split up into RF groups, each group being maintained by a single base replica. That base replica determines what the current state of those MV replicas are, so if it is lost, it isn't clear what deletes need to be sent to those nodes to bring them inline with whatever state we stream to its replacement. I don't suggest this isn't solvable, but it is another aspect we haven't yet considered AFAICT.
          Hide
          jbellis Jonathan Ellis added a comment -

          The majority of use cases are going to be denormalizing what are today query tables, i.e., I want to give the client what it needs by scanning a single partition. Doing extra queries to save disk space may occasionally be necessary but it is not the norm.

          Show
          jbellis Jonathan Ellis added a comment - The majority of use cases are going to be denormalizing what are today query tables, i.e., I want to give the client what it needs by scanning a single partition. Doing extra queries to save disk space may occasionally be necessary but it is not the norm.
          Hide
          jbellis Jonathan Ellis added a comment -

          Where do we rely on a single node?

          Show
          jbellis Jonathan Ellis added a comment - Where do we rely on a single node?
          Hide
          jbellis Jonathan Ellis added a comment -

          Why not just apply MV maintenance to streamed rows the way we do 2i maintenance?

          Show
          jbellis Jonathan Ellis added a comment - Why not just apply MV maintenance to streamed rows the way we do 2i maintenance?
          Hide
          jkrupan Jack Krupansky added a comment -

          Are there any significant advantages or disadvantages of using an MV as a pure global index - no data columns other than the primary key columns?

          Consider the use case of large customer data rows with customer id as the primary key, and you wish to log in by any of customer id, user id, email address, social security number, full name and age or birth date, and name alone, but you really want to simply immediately map any of those alternative logins to the customer id so that the main customer data tables can be accessed directly rather than having all of the data replicated in a bunch of MVs.

          So, each of the four MVs would not need any non-PK data columns per se, since the base table PK is (must be, right?) in the MV PK, I think. Does this make sense? Would there be any special efficiency (or inefficiency) to having essentially empty partitions? For example:

          CREATE TABLE cust (id text, email text, ssn text, name text, address text, zip text, birth timestamp, data map<text,text>, pwd text, PRIMARY KEY (id));
          CREATE MATERIALIZED VIEW email AS SELECT id,email FROM cust PRIMARY KEY (email, id);
          CREATE MATERIALIZED VIEW ssn AS SELECT id,ssn FROM cust PRIMARY KEY (ssn, id);
          CREATE MATERIALIZED VIEW name AS SELECT id,name FROM cust PRIMARY KEY (name, id);
          CREATE MATERIALIZED VIEW email AS SELECT id,name,zip,birth FROM cust PRIMARY KEY ((name,zip,birth), id);
          

          Incidentally, the lookup by name alone would not necessarily be unique - it might not be for an end-user login per se but for a customer service agent who would view the list and then ask the customer some questions to narrow down which specific customer they are.

          Does this specific use case represent what might be considered a best practice use of MVs? If not, why not or what improvements could be made?

          Show
          jkrupan Jack Krupansky added a comment - Are there any significant advantages or disadvantages of using an MV as a pure global index - no data columns other than the primary key columns? Consider the use case of large customer data rows with customer id as the primary key, and you wish to log in by any of customer id, user id, email address, social security number, full name and age or birth date, and name alone, but you really want to simply immediately map any of those alternative logins to the customer id so that the main customer data tables can be accessed directly rather than having all of the data replicated in a bunch of MVs. So, each of the four MVs would not need any non-PK data columns per se, since the base table PK is (must be, right?) in the MV PK, I think. Does this make sense? Would there be any special efficiency (or inefficiency) to having essentially empty partitions? For example: CREATE TABLE cust (id text, email text, ssn text, name text, address text, zip text, birth timestamp, data map<text,text>, pwd text, PRIMARY KEY (id)); CREATE MATERIALIZED VIEW email AS SELECT id,email FROM cust PRIMARY KEY (email, id); CREATE MATERIALIZED VIEW ssn AS SELECT id,ssn FROM cust PRIMARY KEY (ssn, id); CREATE MATERIALIZED VIEW name AS SELECT id,name FROM cust PRIMARY KEY (name, id); CREATE MATERIALIZED VIEW email AS SELECT id,name,zip,birth FROM cust PRIMARY KEY ((name,zip,birth), id); Incidentally, the lookup by name alone would not necessarily be unique - it might not be for an end-user login per se but for a customer service agent who would view the list and then ask the customer some questions to narrow down which specific customer they are. Does this specific use case represent what might be considered a best practice use of MVs? If not, why not or what improvements could be made?
          Hide
          benedict Benedict added a comment -

          ... Except, we also need to consider recovery from entire node failure. If a single node is the record of what deletes need to be sent to the other nodes, we're a bit stuffed actually. Mutation based repair won't save us there.

          Show
          benedict Benedict added a comment - ... Except, we also need to consider recovery from entire node failure. If a single node is the record of what deletes need to be sent to the other nodes, we're a bit stuffed actually. Mutation based repair won't save us there.
          Hide
          benedict Benedict added a comment -

          It looks like mutation based repair might well solve the repair problem. Although we might have to disable repair for MV replicas.

          Question is: can we wait for it?

          Show
          benedict Benedict added a comment - It looks like mutation based repair might well solve the repair problem. Although we might have to disable repair for MV replicas. Question is: can we wait for it?
          Hide
          benedict Benedict added a comment -

          To clarify: my comment was meant to address both correctness and performance. From a correctness point of view it is approximately equivalent to what we have now, it just merges the batchlog work with other work to remove synchronous round trips.

          Show
          benedict Benedict added a comment - To clarify: my comment was meant to address both correctness and performance. From a correctness point of view it is approximately equivalent to what we have now, it just merges the batchlog work with other work to remove synchronous round trips.
          Hide
          tjake T Jake Luciani added a comment -

          The repair bypassing the mutation stage is something we need to address with this patch. But it makes me think we need CASSANDRA-8911

          Show
          tjake T Jake Luciani added a comment - The repair bypassing the mutation stage is something we need to address with this patch. But it makes me think we need CASSANDRA-8911
          Hide
          tjake T Jake Luciani added a comment -

          If the base table update on a replica has been updated, but its paired MV has not, it doesn't matter what the coordinator does with replay, as the base replica will not apply any delta.

          I think you two are missing something about the current implementation. This problem of updating the base but not the MV is why we added the second local batchlog for the Base to MV mutations.

          When the mutation comes into the base replica.

          • Get lock for partition
          • Generate the delta mutations for the paired view replica
          • Update the base replica commit log
          • Write the delta mutations for the paired MV replia on the local base BL
          • Fire the mutations to the paired MV replica async
          • Update the base memtable

          We know the data will eventually get to the paired MV because of the batchlog. That's why we can send the MV updates async.

          ------

          Show
          tjake T Jake Luciani added a comment - If the base table update on a replica has been updated, but its paired MV has not, it doesn't matter what the coordinator does with replay, as the base replica will not apply any delta. I think you two are missing something about the current implementation. This problem of updating the base but not the MV is why we added the second local batchlog for the Base to MV mutations. When the mutation comes into the base replica. Get lock for partition Generate the delta mutations for the paired view replica Update the base replica commit log Write the delta mutations for the paired MV replia on the local base BL Fire the mutations to the paired MV replica async Update the base memtable We know the data will eventually get to the paired MV because of the batchlog. That's why we can send the MV updates async. ------
          Hide
          benedict Benedict added a comment - - edited

          +Inf to what Sylvain just said. We should definitely step back and reassess all the options.

          Now, I think it is probably sufficient for correctness to perform our delta generation on reconciliation of repaired with unrepaired data. Or, more simply: to treat all repaired data we receive as needing delta-generation performed on it. I don't think at first blush this is more or less onerous than would be required for any other approach, but I may be wrong. This definitely needs a lot of thought.

          Show
          benedict Benedict added a comment - - edited +Inf to what Sylvain just said. We should definitely step back and reassess all the options. Now, I think it is probably sufficient for correctness to perform our delta generation on reconciliation of repaired with unrepaired data. Or, more simply: to treat all repaired data we receive as needing delta-generation performed on it. I don't think at first blush this is more or less onerous than would be required for any other approach, but I may be wrong. This definitely needs a lot of thought.
          Hide
          slebresne Sylvain Lebresne added a comment -

          We have designed it as async

          Unless I've missed some part of how "we designed it", that's actually the opposite. In fact, the reasoning I expressed in my previous comment, which to the best of my knowledge is the reasoning being the design currently implemented, was relying on sync MV updates for correction.

          I say "was" because all this probably doesn't really matter since Benedict has a point: with or without sync MV updates, there is holes in this design. Both because, to repeat what he says, replaying mutations from the coordinator batch log doesn't work if the mutation was applied on the base replica but not the paired MV, but also because repair can screw us up by propagating base-table updates behind our backs without going through the MV update code path. In other words, the design as implemented is not eventually consistent (and in the case of the repair problem, you don't even need any failure to get permanent inconsistencies, just unlucky timing).

          And I don't see an easy tweak to the implemented design that would fix both problems. Benedict suggestion above of "batch-logging deltas" might work for the first problem, but we'd need to look at the details and convince ourselves it does work in all cases before even thinking of implementing it (at which point it will unlikely be a minor change to the current patch), and that still leave the repair problem to solve. And so it kind of put us back to the drawing board, and I think ideas around using cell reconciliation to generate fix to MVs (as was initially considered) would also be worth reconsidering (I'm not saying I have anything close to a complete design in mind, and there is for sure a number of problems to solve in that case too, but it does feel like it would handle the repair problem natively if we can make it work, which is a plus).

          Anyway, I would humbly suggest we focus on that problem first, because unless we're willing to release MVs with "hopeful consistency" (which, to put it mildly, I'm not a fan), this might well compromise a 3.0 release.

          Show
          slebresne Sylvain Lebresne added a comment - We have designed it as async Unless I've missed some part of how "we designed it", that's actually the opposite. In fact, the reasoning I expressed in my previous comment , which to the best of my knowledge is the reasoning being the design currently implemented, was relying on sync MV updates for correction. I say "was" because all this probably doesn't really matter since Benedict has a point: with or without sync MV updates, there is holes in this design. Both because, to repeat what he says, replaying mutations from the coordinator batch log doesn't work if the mutation was applied on the base replica but not the paired MV, but also because repair can screw us up by propagating base-table updates behind our backs without going through the MV update code path. In other words, the design as implemented is not eventually consistent (and in the case of the repair problem, you don't even need any failure to get permanent inconsistencies, just unlucky timing). And I don't see an easy tweak to the implemented design that would fix both problems. Benedict suggestion above of "batch-logging deltas" might work for the first problem, but we'd need to look at the details and convince ourselves it does work in all cases before even thinking of implementing it (at which point it will unlikely be a minor change to the current patch), and that still leave the repair problem to solve. And so it kind of put us back to the drawing board, and I think ideas around using cell reconciliation to generate fix to MVs (as was initially considered) would also be worth reconsidering (I'm not saying I have anything close to a complete design in mind, and there is for sure a number of problems to solve in that case too, but it does feel like it would handle the repair problem natively if we can make it work, which is a plus). Anyway, I would humbly suggest we focus on that problem first, because unless we're willing to release MVs with "hopeful consistency" (which, to put it mildly, I'm not a fan), this might well compromise a 3.0 release.
          Hide
          jshook Jonathan Shook added a comment -

          The comment about adding a hop was with respect to what users would currently be doing to maintain multiple views of data. They don't expect that there is a proxy proxy for their writes, no matter whether they are using async or not, batches or not.

          Show
          jshook Jonathan Shook added a comment - The comment about adding a hop was with respect to what users would currently be doing to maintain multiple views of data. They don't expect that there is a proxy proxy for their writes, no matter whether they are using async or not, batches or not.
          Hide
          jshook Jonathan Shook added a comment -

          This goes directly to my point. It would be ideal if we simply allow users to simplify what they are already doing with the least amount of "special" handling we can add to the mix. In terms of solving the problem in a way that users understand, we must strive to compose a solution from the already established primitives that we teach users about all the time. Any failure modes should be explained in those terms as well. Other approach are likely to create more "special cases", which I think we all can agree are not good for anybody.

          Show
          jshook Jonathan Shook added a comment - This goes directly to my point. It would be ideal if we simply allow users to simplify what they are already doing with the least amount of "special" handling we can add to the mix. In terms of solving the problem in a way that users understand, we must strive to compose a solution from the already established primitives that we teach users about all the time. Any failure modes should be explained in those terms as well. Other approach are likely to create more "special cases", which I think we all can agree are not good for anybody.
          Hide
          jbellis Jonathan Ellis added a comment -

          No. If we do add a sync option it will not be the default so there are no forward compatibility issues.

          Show
          jbellis Jonathan Ellis added a comment - No. If we do add a sync option it will not be the default so there are no forward compatibility issues.
          Hide
          jbellis Jonathan Ellis added a comment -

          That's because that's the special case where the user knows there is no early value to delete out of the MV. We can't assume that here until we have CASSANDRA-9779.

          If you're hand-rolling an MV that needs to deal with updated values you'll hit the exact same issues we're discussing here.

          That is if you think it through enough. I would guess that most people do not.

          Show
          jbellis Jonathan Ellis added a comment - That's because that's the special case where the user knows there is no early value to delete out of the MV. We can't assume that here until we have CASSANDRA-9779 . If you're hand-rolling an MV that needs to deal with updated values you'll hit the exact same issues we're discussing here. That is if you think it through enough. I would guess that most people do not.
          Hide
          benedict Benedict added a comment -

          As well, it does add potentially another hop,

          No, it removes a hop. Several hops: Currently we write to the batch log synchronously before we do anything else. In fact we do it twice. These are all double-hops. I'm proposing removing all of them, by exploiting the fact that typically the coordinator is an owner, and so as soon as two owners have the mutations in their local batch logs we're about as good as we'll get that these mutations will be applied, and if they aren't going to be applied they also won't partially apply.

          To make a more general statement: I think there is perhaps a failure to appreciate the difficulty of the problem Sylvain was raising, around ensuring we do not end up with a corrupted MV that we have no idea how to repair. In fact, I very much doubt we are going to achieve this no matter how hard we try, and this is going to end up making us look pretty bad. If a user fails to keep their views in sync, that's one thing, but if we do it's another. I'm talking about even eventually consistent here. There are a lot of nook and cranny failure scenarios and that's before we consider repair. Which, as I say, I suspect completely ruins us. Ironically, of course

          Show
          benedict Benedict added a comment - As well, it does add potentially another hop, No, it removes a hop. Several hops: Currently we write to the batch log synchronously before we do anything else. In fact we do it twice. These are all double-hops. I'm proposing removing all of them, by exploiting the fact that typically the coordinator is an owner, and so as soon as two owners have the mutations in their local batch logs we're about as good as we'll get that these mutations will be applied, and if they aren't going to be applied they also won't partially apply. To make a more general statement: I think there is perhaps a failure to appreciate the difficulty of the problem Sylvain was raising, around ensuring we do not end up with a corrupted MV that we have no idea how to repair. In fact, I very much doubt we are going to achieve this no matter how hard we try, and this is going to end up making us look pretty bad. If a user fails to keep their views in sync, that's one thing, but if we do it's another. I'm talking about even eventually consistent here. There are a lot of nook and cranny failure scenarios and that's before we consider repair. Which, as I say, I suspect completely ruins us. Ironically, of course
          Hide
          brianmhess Brian Hess added a comment -

          Let's go back to the basic use case that this is supposed to replace/help/make better. The case where we want two query tables for the same data. That is, they have the same primary keys, but different partition keys (and clustering column orders).

          Today, I would do this by having a logged batch for the insert and that batch would insert into each of the two query tables. With this I get some data consistency guarantees. For example, if the client returns "success", I know that both of the inserts were accepted at the desired consistency level. So, if I did 2 writes at CL_QUORUM, and I receive a "success", then I know I can then do a CL_QUORUM read of either table and see the most recent data.

          However, with this "asynchronous" MV approach, I no longer get this behavior. I write to the base table at CL_QUORUM and get the "success" return. At that point, I can do a CL_QUORUM read from the base table and see the most recent insert. However, if I do a CL_QUORUM read from the MV, I have no guarantees at all.

          This approach does not address the basic situation that we are trying to cover.

          Show
          brianmhess Brian Hess added a comment - Let's go back to the basic use case that this is supposed to replace/help/make better. The case where we want two query tables for the same data. That is, they have the same primary keys, but different partition keys (and clustering column orders). Today, I would do this by having a logged batch for the insert and that batch would insert into each of the two query tables. With this I get some data consistency guarantees. For example, if the client returns "success", I know that both of the inserts were accepted at the desired consistency level. So, if I did 2 writes at CL_QUORUM, and I receive a "success", then I know I can then do a CL_QUORUM read of either table and see the most recent data. However, with this "asynchronous" MV approach, I no longer get this behavior. I write to the base table at CL_QUORUM and get the "success" return. At that point, I can do a CL_QUORUM read from the base table and see the most recent insert. However, if I do a CL_QUORUM read from the MV, I have no guarantees at all. This approach does not address the basic situation that we are trying to cover.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          May I then suggest that we add that as a clause to the CREATE MATERIALIZED VIEW statement, so it's 1) explicit and 2) is future proofed?

          Show
          rustyrazorblade Jon Haddad added a comment - May I then suggest that we add that as a clause to the CREATE MATERIALIZED VIEW statement, so it's 1) explicit and 2) is future proofed?
          Hide
          jshook Jonathan Shook added a comment - - edited

          If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic:

          • Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case.
          • The same CL is generally used for all related tables.
          • Savvy users will do this with async with the same CL for all of these operations.

          So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because they are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults.

          It doesn't really seem necessary that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above.

          Show
          jshook Jonathan Shook added a comment - - edited If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic: Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case. The same CL is generally used for all related tables. Savvy users will do this with async with the same CL for all of these operations. So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because they are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults. It doesn't really seem necessary that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          I'm still skeptical that sync MV updates are useful. I get that people would like to have them, but it's a CAP problem. You only get sync updates if you are willing to throw out availability very very quickly. Async replication to MV may be surprising, but UnavailableException for CL.ONE writes when a single machine goes down is more surprising.

          (FWIW I have consistently told people that our design goal is for MV to be eventually consistent with the base data and everyone has been fine with that.)

          So: we can keep sync as an option for the future, but for 3.0 we should not scope creep this. We have designed it as async which is the right thing for 90% of use cases and that is what we should ship.

          Show
          jbellis Jonathan Ellis added a comment - - edited I'm still skeptical that sync MV updates are useful. I get that people would like to have them, but it's a CAP problem. You only get sync updates if you are willing to throw out availability very very quickly. Async replication to MV may be surprising, but UnavailableException for CL.ONE writes when a single machine goes down is more surprising. (FWIW I have consistently told people that our design goal is for MV to be eventually consistent with the base data and everyone has been fine with that.) So: we can keep sync as an option for the future, but for 3.0 we should not scope creep this. We have designed it as async which is the right thing for 90% of use cases and that is what we should ship.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          Personally I think it would be fine as a setting on the view itself.

          Show
          rustyrazorblade Jon Haddad added a comment - Personally I think it would be fine as a setting on the view itself.
          Hide
          benedict Benedict added a comment -

          OK, so I think we have a lot of competing goals that are not being discussed in a clearly delineated fashion:

          • How do we ensure the MV is not corrupted
          • How do we maintain AP
          • How do we best honour user consistency level
          • How do we do it quickly?

          Unfortunately, goals (1) and (2) are directly opposed to each other. I hadn't originally envisaged having 20+ MV, but if that's on the cards, (1) really must reign supreme. Otherwise we will end up with a CL.LITERALLY_EVERYONE scenario.

          The problem here is that the coordinator-level batch log is almost useless, and in fact I think absolutely guaranteeing no-MV corruption is nigh impossible. If the base table update on a replica has been updated, but its paired MV has not, it doesn't matter what the coordinator does with replay, as the base replica will not apply any delta.

          So, I'm pretty sure we can drop the coordinator batch log. What I propose instead, is to always ensure an owner is the coordinator: if a non-owner coordinates, it just proxies it on to one (or more, if a response is slow) owning coordinators.

          _Loosely* speaking, it can then:

          1. write the mutation to the other base-replicas;
          2. perform its local read-before-write, and write to the local batchlog the total set of MV deltas it will apply (along with the base mutation)
          3. write to its paired replica for one of its MVs both the MV mutation and the whole batchlog of mutations, including mutations for the other base replicas
          4. once any of these respond (the other base replicas, or the MV replica), we're GTG as we have confidence that two or more base replicas will receive the mutation, and so we write to the remaining MV replicas

          The non-coordinator base-replicas just perform all of their work at once: they write their deltas (and self mutation) to the local batchlog, then write this batchlog to one of their MV replicas simultaneously to sending all of their updates. This is safe because we already know the coordinator has written this to their batchlog, and so our doing so is enough to reach QUORUM for commit.

          The main advantage of this is that we have no synchronous operations, but we still reasonably guarantee we eventually reach consistency - at least as well as we do currently (I'm not 100% familiar on how we write to the local batch log currently, but ensuring we write deltas at-once for all MVs is critical to correctness). The reason we do not need to perform any synchronous batchlog writes is that if, for whatever reason, we lose all of the nodes involved, then it does not matter that the batchlog records never made it: nor can the base mutations, nor any follow-on MV mutations. The slate is wiped clean.

          I have to say, though: I'm more than a little worried about how repair factors into this equation. I kind of suspect we need to guarantee all of these logs are empty before we run repair, or we need to produce deltas on receipt of repaired data. This is true of all of the outlined approaches.

          Show
          benedict Benedict added a comment - OK, so I think we have a lot of competing goals that are not being discussed in a clearly delineated fashion: How do we ensure the MV is not corrupted How do we maintain AP How do we best honour user consistency level How do we do it quickly? Unfortunately, goals (1) and (2) are directly opposed to each other. I hadn't originally envisaged having 20+ MV, but if that's on the cards, (1) really must reign supreme. Otherwise we will end up with a CL.LITERALLY_EVERYONE scenario. The problem here is that the coordinator-level batch log is almost useless, and in fact I think absolutely guaranteeing no-MV corruption is nigh impossible. If the base table update on a replica has been updated, but its paired MV has not, it doesn't matter what the coordinator does with replay, as the base replica will not apply any delta. So, I'm pretty sure we can drop the coordinator batch log. What I propose instead, is to always ensure an owner is the coordinator: if a non-owner coordinates, it just proxies it on to one (or more, if a response is slow) owning coordinators. _Loosely* speaking, it can then: write the mutation to the other base-replicas; perform its local read-before-write, and write to the local batchlog the total set of MV deltas it will apply (along with the base mutation) write to its paired replica for one of its MVs both the MV mutation and the whole batchlog of mutations, including mutations for the other base replicas once any of these respond (the other base replicas, or the MV replica), we're GTG as we have confidence that two or more base replicas will receive the mutation, and so we write to the remaining MV replicas The non-coordinator base-replicas just perform all of their work at once: they write their deltas (and self mutation) to the local batchlog, then write this batchlog to one of their MV replicas simultaneously to sending all of their updates. This is safe because we already know the coordinator has written this to their batchlog, and so our doing so is enough to reach QUORUM for commit. The main advantage of this is that we have no synchronous operations, but we still reasonably guarantee we eventually reach consistency - at least as well as we do currently (I'm not 100% familiar on how we write to the local batch log currently, but ensuring we write deltas at-once for all MVs is critical to correctness). The reason we do not need to perform any synchronous batchlog writes is that if, for whatever reason, we lose all of the nodes involved, then it does not matter that the batchlog records never made it: nor can the base mutations, nor any follow-on MV mutations. The slate is wiped clean. I have to say, though: I'm more than a little worried about how repair factors into this equation. I kind of suspect we need to guarantee all of these logs are empty before we run repair, or we need to produce deltas on receipt of repaired data. This is true of all of the outlined approaches.
          Hide
          tupshin Tupshin Harper added a comment - - edited

          OK, so let me summarize my view of the conflicting viewpoints here

          1. If the MV shares the same partition key (and only reorders the partition based on different clustering columns), then the problem is relatively easy. Unfortunately the general consensus is that a common case will be to have different partition keys in the MV than the base table, so we can't support only that easy case.
          2. If the MV has a different partition key than the base table, then there are inherently more nodes involved in fulfilling the entire request, and we have to address that case.
          3. As T Jake Luciani and Jonathan Ellis say, the more nodes involved in a query, the higher the risk of unavailability if the MV is updated synchronously.
          4. Some use cases expect synchronous updates (as argued by Jon Haddad and Brian Hess
          5. But others use cases definitely do not. I think it is absurd to say that just because a table has a MV, every write should care about the MV. Even more absurd to say that adding an MV to a table will reduce the availability of all writes to the base table.

          Given all of those, the conclusion that both sync and async forms are necessary seems totally unavoidable.

          Ideally, I'd like to see an extension of what Aleksey Yeschenko proposed above but be much more thorough and flexible about it.

          If each request were able to pass multiple consistency-level contracts to the coordinator, each one could represent the expectation for a separate callback at the driver level.
          e.g. A query to a table with a MV could express the following compound consistency levels.

           {LQ, LOCAL_ONE{DC3,DC4}, LQ{MV1,MV2}} 

          That would tell the coordinator to deliver three separate notifications back to the client. One when LQ in the local dc was fulfilled. Another when at least one copy was delivered to each of DC3 and DC4, and another when LQ was fulfilled in the local dc for MV1 and MV2.

          and yes, you would need more flexible syntax that could express per-dc per table consistency, e.g.

          LQ{DCs:DC3,DC4,VIEWS:MV1,MV2}

          I realize that this is a very far-fetched proposal, but I wanted to throw it out there as, imo, it reflects the theoretically best option that fulfills everybody's requirements. (and is also a very general mechanism that could be used in other scenarios).

          Short of that, I don't think there is any choice but to support both sync and async forms of writes to tables with MVs.

          One more point(not to distract from the above). With the current design of MVs, there will always be risk of inconsistent reads (timeouts leaving data queryable in the primary table but not in one or more MVs) until the data is eventually propagated to the MV. While it would be at a high cost, RAMP would still be useful to be to provide read isolation in that scenario.

          Show
          tupshin Tupshin Harper added a comment - - edited OK, so let me summarize my view of the conflicting viewpoints here If the MV shares the same partition key (and only reorders the partition based on different clustering columns), then the problem is relatively easy. Unfortunately the general consensus is that a common case will be to have different partition keys in the MV than the base table, so we can't support only that easy case. If the MV has a different partition key than the base table, then there are inherently more nodes involved in fulfilling the entire request, and we have to address that case. As T Jake Luciani and Jonathan Ellis say, the more nodes involved in a query, the higher the risk of unavailability if the MV is updated synchronously. Some use cases expect synchronous updates (as argued by Jon Haddad and Brian Hess But others use cases definitely do not. I think it is absurd to say that just because a table has a MV, every write should care about the MV. Even more absurd to say that adding an MV to a table will reduce the availability of all writes to the base table. Given all of those, the conclusion that both sync and async forms are necessary seems totally unavoidable. Ideally, I'd like to see an extension of what Aleksey Yeschenko proposed above but be much more thorough and flexible about it. If each request were able to pass multiple consistency-level contracts to the coordinator, each one could represent the expectation for a separate callback at the driver level. e.g. A query to a table with a MV could express the following compound consistency levels. {LQ, LOCAL_ONE{DC3,DC4}, LQ{MV1,MV2}} That would tell the coordinator to deliver three separate notifications back to the client. One when LQ in the local dc was fulfilled. Another when at least one copy was delivered to each of DC3 and DC4, and another when LQ was fulfilled in the local dc for MV1 and MV2. and yes, you would need more flexible syntax that could express per-dc per table consistency, e.g. LQ{DCs:DC3,DC4,VIEWS:MV1,MV2} I realize that this is a very far-fetched proposal, but I wanted to throw it out there as, imo, it reflects the theoretically best option that fulfills everybody's requirements. (and is also a very general mechanism that could be used in other scenarios). Short of that, I don't think there is any choice but to support both sync and async forms of writes to tables with MVs. One more point(not to distract from the above). With the current design of MVs, there will always be risk of inconsistent reads (timeouts leaving data queryable in the primary table but not in one or more MVs) until the data is eventually propagated to the MV. While it would be at a high cost, RAMP would still be useful to be to provide read isolation in that scenario.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          As an AP database that prides itself for scalability, high availability, and decent write throughput, I don't see how we can ship with sync - a requirement that obliterates both availability and throughput - at least by default.

          Maybe there is a place for a second per-query CL parameter (they way we have with LWT queries), that would specify a CL for MV mutatations (that would default to ANY), maybe not.

          Show
          iamaleksey Aleksey Yeschenko added a comment - As an AP database that prides itself for scalability, high availability, and decent write throughput, I don't see how we can ship with sync - a requirement that obliterates both availability and throughput - at least by default. Maybe there is a place for a second per-query CL parameter (they way we have with LWT queries), that would specify a CL for MV mutatations (that would default to ANY ), maybe not.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          It seems odd to me that as a user I'd ask for a certain consistency level and I would get a successful response that was essentially footnoted with "not really".

          Perhaps I was using batching incorrectly, but in my experience I found it useful to keep multiple views of my data up to date, and I was doing so at QUORUM because I needed to be strongly consistent. Not having it even as an option kills a lot of use cases.

          This next part is up for debate, but is based on the conversations and questions I've had with people at Cassandra Day. This feature is a big deal for people coming from the RDBMS world - probably more so than existing users. There's usually quite a bit of discussion around this topic. At my last talk, I brought up materialized views and people bit onto it like a dog with a bone. I feel like mandatory async is a strange caveat that would be unexpected for these people.

          Show
          rustyrazorblade Jon Haddad added a comment - It seems odd to me that as a user I'd ask for a certain consistency level and I would get a successful response that was essentially footnoted with "not really". Perhaps I was using batching incorrectly, but in my experience I found it useful to keep multiple views of my data up to date, and I was doing so at QUORUM because I needed to be strongly consistent. Not having it even as an option kills a lot of use cases. This next part is up for debate, but is based on the conversations and questions I've had with people at Cassandra Day. This feature is a big deal for people coming from the RDBMS world - probably more so than existing users. There's usually quite a bit of discussion around this topic. At my last talk, I brought up materialized views and people bit onto it like a dog with a bone. I feel like mandatory async is a strange caveat that would be unexpected for these people.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Well, I'm gonna guess that the absence of synchronous MV updates will surprise and make more than one user sad, but current time constraints force me to choose my battles carefully so I'm gonna leave it at that.

          Show
          slebresne Sylvain Lebresne added a comment - Well, I'm gonna guess that the absence of synchronous MV updates will surprise and make more than one user sad, but current time constraints force me to choose my battles carefully so I'm gonna leave it at that.
          Hide
          tjake T Jake Luciani added a comment -

          The batchlog is to guarantee eventual consistency of the MV. The user CL is mostly to get guarantee that you can see your writes. I don't understand why you'd think the former would prevent the latter.

          Sylvain Lebresne the inline reply method broke

          Sorry, I was referring to the batchlog feature in general from a user perspective.

          Show
          tjake T Jake Luciani added a comment - The batchlog is to guarantee eventual consistency of the MV. The user CL is mostly to get guarantee that you can see your writes. I don't understand why you'd think the former would prevent the latter. Sylvain Lebresne the inline reply method broke Sorry, I was referring to the batchlog feature in general from a user perspective.
          Hide
          jbellis Jonathan Ellis added a comment -

          I disagree about making synchronous the default. As Jake points out that can kill your availability even on a single MV if you are unlucky with replica placement, and it's virtually guaranteed to kill it with many MV. I would go so far as to say that synchronous MV updates are not useful and we should not bother adding it.

          Show
          jbellis Jonathan Ellis added a comment - I disagree about making synchronous the default. As Jake points out that can kill your availability even on a single MV if you are unlucky with replica placement, and it's virtually guaranteed to kill it with many MV. I would go so far as to say that synchronous MV updates are not useful and we should not bother adding it.
          Hide
          slebresne Sylvain Lebresne added a comment -

          The batchlog is to guarantee eventual consistency of the MV. The user CL is mostly to get guarantee that you can see your writes. I don't understand why you'd think the former would prevent the latter.

          But anyway, I'll retract part of my previous comment. I'm not really that oppose to having the option of doing asynchronous MV update. Rather I feel that 1) doing synchonous updates (which is equivalent of respecting the CL for the MV update too) should be the default because that's what people will expect (as exemplified by Jon and Brian response above) and 2) that we should discuss the addition of such option in a followup ticket. The reason for the later is that I think there is multiple ways to do this, and I typically don't like the idea of making a table option at all, so I don't see the point of discussing this here.

          Show
          slebresne Sylvain Lebresne added a comment - The batchlog is to guarantee eventual consistency of the MV. The user CL is mostly to get guarantee that you can see your writes. I don't understand why you'd think the former would prevent the latter. But anyway, I'll retract part of my previous comment. I'm not really that oppose to having the option of doing asynchronous MV update. Rather I feel that 1) doing synchonous updates (which is equivalent of respecting the CL for the MV update too) should be the default because that's what people will expect (as exemplified by Jon and Brian response above) and 2) that we should discuss the addition of such option in a followup ticket. The reason for the later is that I think there is multiple ways to do this, and I typically don't like the idea of making a table option at all, so I don't see the point of discussing this here.
          Hide
          tjake T Jake Luciani added a comment -

          Frankly, I'm pretty negative on adding such option.

          But then why do we even offer the batchlog at all? Hand rolled materialized views use them. And if you feel we should guarantee a consistency level then you would never use a batchlog. Since any timeout would mean you didn't achieve your consistency level and you must retry.

          If you are talking about just the UE then we could check the MV replica UP/Down status in the coordinator as well as the base.

          Show
          tjake T Jake Luciani added a comment - Frankly, I'm pretty negative on adding such option. But then why do we even offer the batchlog at all? Hand rolled materialized views use them. And if you feel we should guarantee a consistency level then you would never use a batchlog. Since any timeout would mean you didn't achieve your consistency level and you must retry. If you are talking about just the UE then we could check the MV replica UP/Down status in the coordinator as well as the base.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Tupshin Harper Because we are no longer implementing this as a non-denormalized global index, we don't have multiple partitions to read, so RAMP unfortunately won't solve problems in a materialized view.

          Show
          carlyeks Carl Yeksigian added a comment - Tupshin Harper Because we are no longer implementing this as a non-denormalized global index, we don't have multiple partitions to read, so RAMP unfortunately won't solve problems in a materialized view.
          Hide
          tupshin Tupshin Harper added a comment -

          Just a reminder (since it was a loong time ago in this ticket), that we were going to target immediate consistency once we could leverage RAMP, and not before.

          Show
          tupshin Tupshin Harper added a comment - Just a reminder (since it was a loong time ago in this ticket), that we were going to target immediate consistency once we could leverage RAMP, and not before.
          Hide
          jkrupan Jack Krupansky added a comment -

          multiple MVs being updated

          It would be good to get a handle on what the scalability of MVs per base table is in terms of recommended best practice. Hundreds? Thousands? A few dozen? Maybe just a handful, like 5 or 10 or a dozen?

          I hate it when a feature like this gets implemented without scalability in mind and then some poor/idiot user comes along and tries a use case which is way out of line with the implemented architecture but we provide no guidance as to what the practical limits really are (e.g., number of tables - thousands vs. hundreds.)

          It seems to me that the primary use case is for query tables, where an app might typically have a handful of queries and probably not more than a small number of dozens in even extreme cases.

          In any case, it would be great to be clear about the design limit for number of MVs per base table - and to make sure some testing gets done to assure that the number is practical.

          And by design limit I don't mean a hard limit where more will cause an explicit error, but where performance is considered acceptable.

          Are the MV updates occurring in parallel with each other, or are they serial? How many MVs could a base table have before the MV updates effectively become serialized?

          Show
          jkrupan Jack Krupansky added a comment - multiple MVs being updated It would be good to get a handle on what the scalability of MVs per base table is in terms of recommended best practice. Hundreds? Thousands? A few dozen? Maybe just a handful, like 5 or 10 or a dozen? I hate it when a feature like this gets implemented without scalability in mind and then some poor/idiot user comes along and tries a use case which is way out of line with the implemented architecture but we provide no guidance as to what the practical limits really are (e.g., number of tables - thousands vs. hundreds.) It seems to me that the primary use case is for query tables, where an app might typically have a handful of queries and probably not more than a small number of dozens in even extreme cases. In any case, it would be great to be clear about the design limit for number of MVs per base table - and to make sure some testing gets done to assure that the number is practical. And by design limit I don't mean a hard limit where more will cause an explicit error, but where performance is considered acceptable. Are the MV updates occurring in parallel with each other, or are they serial? How many MVs could a base table have before the MV updates effectively become serialized?
          Hide
          carlyeks Carl Yeksigian added a comment -

          There will be one BL entry on the coordinator, and one on each base replica; all MV updates will be batched together on those replicas.

          Show
          carlyeks Carl Yeksigian added a comment - There will be one BL entry on the coordinator, and one on each base replica; all MV updates will be batched together on those replicas.
          Hide
          slebresne Sylvain Lebresne added a comment -

          So it makes sense to have replication paired.

          Sure, didn't implied otherwise, I just wasn't aware we were doing it.

          I do think waiting for the MV updates to be synchronous will cause a lot more timeouts and write latency (on top of what we have now).But if it's optional then people can choose.

          Frankly, I'm pretty negative on adding such option. I think there is some basic guarantees that shouldn't be optional, and the CL ones are amongst those. Making it optional will have people shoot themselves in the foo all the time. At the very least, I would aks that we don't include such option on this ticket (there is enough stuff to deal with) and open a separate ticket to discuss it (one on which we'd actually benchmark thinks before assuming there will be timeouts).

          Show
          slebresne Sylvain Lebresne added a comment - So it makes sense to have replication paired. Sure, didn't implied otherwise, I just wasn't aware we were doing it. I do think waiting for the MV updates to be synchronous will cause a lot more timeouts and write latency (on top of what we have now).But if it's optional then people can choose. Frankly, I'm pretty negative on adding such option. I think there is some basic guarantees that shouldn't be optional, and the CL ones are amongst those. Making it optional will have people shoot themselves in the foo all the time. At the very least, I would aks that we don't include such option on this ticket (there is enough stuff to deal with) and open a separate ticket to discuss it (one on which we'd actually benchmark thinks before assuming there will be timeouts).
          Hide
          jbellis Jonathan Ellis added a comment -

          If there are multiple MVs being updated, do they get merged into a single set of batchlogs? (I.e. Just one on coordinator, one on each base replica, instead of one per MV.)

          Show
          jbellis Jonathan Ellis added a comment - If there are multiple MVs being updated, do they get merged into a single set of batchlogs? (I.e. Just one on coordinator, one on each base replica, instead of one per MV.)
          Hide
          jbellis Jonathan Ellis added a comment -

          No, you're right. Synchronous MV updates is a terrible idea, which is more obvious when considering the case of more than one MV. In the extreme case you could touch every node in the cluster.

          Show
          jbellis Jonathan Ellis added a comment - No, you're right. Synchronous MV updates is a terrible idea, which is more obvious when considering the case of more than one MV. In the extreme case you could touch every node in the cluster.
          Hide
          tjake T Jake Luciani added a comment -

          We'll only if less than a quorum of node don't answer a particular query, which should be pretty rare unless you have bigger problems with your cluster.

          That said, I hadn't seen we'd decided to go with pairing of base replica to MV replica.

          If we replicate to every MV replica from every base replica the write amplification gets much worse causing more timeouts. So it makes sense to have replication paired.

          I do think waiting for the MV updates to be synchronous will cause a lot more timeouts and write latency (on top of what we have now). But if it's optional then people can choose.

          Show
          tjake T Jake Luciani added a comment - We'll only if less than a quorum of node don't answer a particular query, which should be pretty rare unless you have bigger problems with your cluster. That said, I hadn't seen we'd decided to go with pairing of base replica to MV replica. If we replicate to every MV replica from every base replica the write amplification gets much worse causing more timeouts. So it makes sense to have replication paired. I do think waiting for the MV updates to be synchronous will cause a lot more timeouts and write latency (on top of what we have now). But if it's optional then people can choose.
          Hide
          slebresne Sylvain Lebresne added a comment -

          why not also pair it with RF-2 (or 1, and only support RF=3 for now) partners, to whom it requires the first write to be propagated, without which it does not acknowledge? This could be done with a specialised batchlog write, that goes to the local node and the paired MV node.

          I think I get a vague idea of what you mean but I'm not fully sure (and I'm not fully sure it's practical).

          So lets first make sure I understand. Is the suggestion that to guarantee that if base-table replica applies an update, then RF/2 other ones also do it, we'd send the update to all base table replicas "normally" (without coordinator batchlog), but each replica would 1) write the update to a local-only batchlog and 2) forward the update to RF/2 other base table replicas?

          Show
          slebresne Sylvain Lebresne added a comment - why not also pair it with RF-2 (or 1, and only support RF=3 for now) partners, to whom it requires the first write to be propagated, without which it does not acknowledge? This could be done with a specialised batchlog write, that goes to the local node and the paired MV node. I think I get a vague idea of what you mean but I'm not fully sure (and I'm not fully sure it's practical). So lets first make sure I understand. Is the suggestion that to guarantee that if base-table replica applies an update, then RF/2 other ones also do it, we'd send the update to all base table replicas "normally" (without coordinator batchlog), but each replica would 1) write the update to a local-only batchlog and 2) forward the update to RF/2 other base table replicas?
          Hide
          slebresne Sylvain Lebresne added a comment -

          I don't really think the cost of replaying the coordinator BL matters that much. We'll only if less than a quorum of node don't answer a particular query, which should be pretty rare unless you have bigger problems with your cluster. And given the local BL has a cost on every write, even if small, I don't think that from a performance perspective a local BL is a win.

          That said, I hadn't seen we'd decided to go with pairing of base replica to MV replica. Doing so does justify a local BL (another option has always been to fan out to every MV replica, and since this ticket desperately miss a good description of what exact algorithm is actually implemented, I wasn't sure which option we went with).

          Show
          slebresne Sylvain Lebresne added a comment - I don't really think the cost of replaying the coordinator BL matters that much. We'll only if less than a quorum of node don't answer a particular query, which should be pretty rare unless you have bigger problems with your cluster. And given the local BL has a cost on every write, even if small, I don't think that from a performance perspective a local BL is a win. That said, I hadn't seen we'd decided to go with pairing of base replica to MV replica. Doing so does justify a local BL (another option has always been to fan out to every MV replica, and since this ticket desperately miss a good description of what exact algorithm is actually implemented, I wasn't sure which option we went with).
          Hide
          slebresne Sylvain Lebresne added a comment -

          Pedantically you are correct. Which is why I said "effectively" and not "literally."

          Well, I mean, CL is always more about "when does we answer the client" than "what amount of work we do internally". Every write is always written to every replica for instance, the CL is just a matter of how long we wait before answering the client. I'm arguing this is very exactly the case here too. Anyway, your "the other side of that coin" made it sounds like we were doing unusual regarding the CL, something that may not be desirable. I don't understand what that is if that's the case.

          Show
          slebresne Sylvain Lebresne added a comment - Pedantically you are correct. Which is why I said "effectively" and not "literally." Well, I mean, CL is always more about "when does we answer the client" than "what amount of work we do internally". Every write is always written to every replica for instance, the CL is just a matter of how long we wait before answering the client. I'm arguing this is very exactly the case here too. Anyway, your "the other side of that coin" made it sounds like we were doing unusual regarding the CL, something that may not be desirable. I don't understand what that is if that's the case.
          Hide
          tjake T Jake Luciani added a comment -

          1. This goes way back to benedicts main concept https://issues.apache.org/jira/browse/CASSANDRA-6477?focusedCommentId=14039757&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039757

          We have each replica on the base table send the mv update to a single mv replica. So replicas are "paired" 1:1

          2. Since the coordinator is a BL against a QUORUM of all base replicas which will always send to MV replicas we have a lot more work todo than a only sending a failed base to view update.

          Show
          tjake T Jake Luciani added a comment - 1. This goes way back to benedicts main concept https://issues.apache.org/jira/browse/CASSANDRA-6477?focusedCommentId=14039757&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039757 We have each replica on the base table send the mv update to a single mv replica. So replicas are "paired" 1:1 2. Since the coordinator is a BL against a QUORUM of all base replicas which will always send to MV replicas we have a lot more work todo than a only sending a failed base to view update.
          Hide
          jbellis Jonathan Ellis added a comment -

          Pedantically you are correct. Which is why I said "effectively" and not "literally."

          Show
          jbellis Jonathan Ellis added a comment - Pedantically you are correct. Which is why I said "effectively" and not "literally."
          Hide
          jbellis Jonathan Ellis added a comment -

          1. Paired replica? What?

          2. Under what conditions does replica BL save you from replaying coordinator BL?

          Show
          jbellis Jonathan Ellis added a comment - 1. Paired replica? What? 2. Under what conditions does replica BL save you from replaying coordinator BL?
          Hide
          slebresne Sylvain Lebresne added a comment -

          But the other side of that coin is is, we're effectively promoting all operations to at least QUORUM regardless of what the user asked for...

          We're not. In the description I made above, we need to wait on QUORUM response to remove from the batchlog, but we don't need to wait on QUORUM to respond to the user. Unless my reasoning is broken, we do respect the CL levels exactly as we should.

          Show
          slebresne Sylvain Lebresne added a comment - But the other side of that coin is is, we're effectively promoting all operations to at least QUORUM regardless of what the user asked for... We're not. In the description I made above, we need to wait on QUORUM response to remove from the batchlog, but we don't need to wait on QUORUM to respond to the user. Unless my reasoning is broken, we do respect the CL levels exactly as we should.
          Hide
          jbellis Jonathan Ellis added a comment -

          From a user's perspective, I agree with Sylvain that the MV should respect the CL. I wouldn't expect to do a write at ALL, then do a read and get an old record back.

          But the other side of that coin is is, we're effectively promoting all operations to at least QUORUM regardless of what the user asked for...

          Show
          jbellis Jonathan Ellis added a comment - From a user's perspective, I agree with Sylvain that the MV should respect the CL. I wouldn't expect to do a write at ALL, then do a read and get an old record back. But the other side of that coin is is, we're effectively promoting all operations to at least QUORUM regardless of what the user asked for...
          Hide
          benedict Benedict added a comment -

          We could support both approaches in terms of a new flag?

          I think permitting faster operation with only "eventual consistency" guarantees is a good thing, since most users doing their own denormalisation probably get no better than that. A flag on construction?

          Show
          benedict Benedict added a comment - We could support both approaches in terms of a new flag? I think permitting faster operation with only "eventual consistency" guarantees is a good thing, since most users doing their own denormalisation probably get no better than that. A flag on construction?
          Hide
          tjake T Jake Luciani added a comment -

          I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? If my reasoning above is correct, the coordinator batchlog is enough to guarantee durability and eventual consistency because we will replay the whole mutation until a QUORUM of replica acknowledges success.

          Yes, if we error out if the base is unable to replicate to the view then the second BL is redundant. However there are a few reasons why we did what we did.

          1. Your availability is cut in half when you use a MV with these guarantees. I have a 5 node cluster RF=3 and I want to write at CL.ONE. If I have an MV I can no longer handle two down nodes. Since the paired replica for the one base node might be down.
          2. The cost of replaying the coordinator BL is much higher than replaying the base to replica BL since the latter is 1:1.

          I do agree there is a disconnect in terms of consistency level when using the MV but the batchlog feature was written to handle this. We could support both approaches in terms of a new flag? Or are we willing to take a hit on availability?

          Show
          tjake T Jake Luciani added a comment - I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? If my reasoning above is correct, the coordinator batchlog is enough to guarantee durability and eventual consistency because we will replay the whole mutation until a QUORUM of replica acknowledges success. Yes, if we error out if the base is unable to replicate to the view then the second BL is redundant. However there are a few reasons why we did what we did. 1. Your availability is cut in half when you use a MV with these guarantees. I have a 5 node cluster RF=3 and I want to write at CL.ONE. If I have an MV I can no longer handle two down nodes. Since the paired replica for the one base node might be down. 2. The cost of replaying the coordinator BL is much higher than replaying the base to replica BL since the latter is 1:1. I do agree there is a disconnect in terms of consistency level when using the MV but the batchlog feature was written to handle this. We could support both approaches in terms of a new flag? Or are we willing to take a hit on availability?
          Hide
          benedict Benedict added a comment -

          I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it?

          The thing is, the coordinator-level batchlog write is quite expensive. It seems we've paired each node with one MV node, but here's an idea: why not also pair it with RF-2 (or 1, and only support RF=3 for now) partners, to whom it requires the first write to be propagated, without which it does not acknowledge? This could be done with a specialised batchlog write, that goes to the local node and the paired MV node. That way, most importantly, we do not have to wait synchronously for the batchlog records to be written: if they're lost, then the corruption caused by their loss is also lost.

          Show
          benedict Benedict added a comment - I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? The thing is, the coordinator-level batchlog write is quite expensive. It seems we've paired each node with one MV node, but here's an idea: why not also pair it with RF-2 (or 1, and only support RF=3 for now) partners, to whom it requires the first write to be propagated, without which it does not acknowledge? This could be done with a specialised batchlog write, that goes to the local node and the paired MV node. That way, most importantly, we do not have to wait synchronously for the batchlog records to be written: if they're lost, then the corruption caused by their loss is also lost.
          Hide
          brianmhess Brian Hess added a comment -

          +1 I think that is the "promise" of the MV.

          Show
          brianmhess Brian Hess added a comment - +1 I think that is the "promise" of the MV.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          From a user's perspective, I agree with Sylvain that the MV should respect the CL. I wouldn't expect to do a write at ALL, then do a read and get an old record back.

          Show
          rustyrazorblade Jon Haddad added a comment - From a user's perspective, I agree with Sylvain that the MV should respect the CL. I wouldn't expect to do a write at ALL, then do a read and get an old record back.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Made the base -> view mutations async. Once we write to the local batchlog, we don't care if the actual mutations are sent, it's best effort. So we can fire and forget these and update the base memtable.

          That's correct from an eventual consistency point of view, but I'm pretty sure this breaks the CL guarantees for the user. What we want is that if I write at CL.QUORUM the base table, and then read my MV at CL.QUORUM, then I'm guaranteed to see my previous update. But that requires that each replica does synchronous updates to the MV, and with the user CL. Writing a local batchlog is not enough in particular since it doesn't give any kind of guarantee of the visibility of the update. See my next comment though on that local batchlog.

          Made the Base -> View batchlog update local only

          I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? If my reasoning above is correct, the coordinator batchlog is enough to guarantee durability and eventual consistency because we will replay the whole mutation until a QUORUM of replica acknowledges success.

          Show
          slebresne Sylvain Lebresne added a comment - Made the base -> view mutations async. Once we write to the local batchlog, we don't care if the actual mutations are sent, it's best effort. So we can fire and forget these and update the base memtable. That's correct from an eventual consistency point of view, but I'm pretty sure this breaks the CL guarantees for the user. What we want is that if I write at CL.QUORUM the base table, and then read my MV at CL.QUORUM , then I'm guaranteed to see my previous update. But that requires that each replica does synchronous updates to the MV, and with the user CL. Writing a local batchlog is not enough in particular since it doesn't give any kind of guarantee of the visibility of the update. See my next comment though on that local batchlog. Made the Base -> View batchlog update local only I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? If my reasoning above is correct, the coordinator batchlog is enough to guarantee durability and eventual consistency because we will replay the whole mutation until a QUORUM of replica acknowledges success.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Why do we need this at all? Since replicas are in charge of updating MV then normal hints should perform the same function as batchlog except without the performance hint in the normal case.

          Allow me to sum up how we deal with consistency guarantees, why we do it this way and why I don't think hints work. I'm sorry if that response is a bit verbose but as this is the most important thing of this ticket imo, I think it bears repeating and making sure we're all on the same page.

          The main guarantee we have to provide here is that MV are eventually consistent with their base table. In other words, whatever failure scenarios we run into, we should never have an inconsistency that never gets resolved. The canonical example of why this is not a given is we have a column c = 2 in the base table that is also in a MV PK, and we have 2 concurrent updates A (sets c = 3) and B (sets c = 4). Without any kind of protection, we could end up with the MV permanently having 2 entries, one of A and one for B, which is incorrect (which should eventually converge to the update that has the biggest timestamp since that's what the base table will keep). To the best of my knowledge, there is 2 fundamental components to avoiding such permanent inconsistency in the currently written patch/approach:

          1. On each replica, we synchronize/serialize the read-before-write done on the base table. This guarantees that we won't have A and B racing on a single base-table replica. Or, in other words, if the same replica sees both update (where "sees" means "do the read-before-write-and-update-MV-accordingly" dance), then it will properly update the MV. And since each base-table replica updates all MV-table replica, it's enough that a single base-table replica sees both update to guarantee eventually consistent of the MV. But we do need to guarantee at least one such base-table replica sees both updates and that's the 2nd component.
          2. To provided that latter guarantee, we first put each base-table update that include MV updates in the batchlog on the coordinator, and we only remove it from the batchlog once a QUORUM of replica have aknowledged the write (this is importantly not dependent of the CL, eventual consistency must be guaranteed whatever CL you use). That guarantees us that until a QUORUM of replica have seen the update, we'll keep replaying it, which in turns guarantees us that for any 2 updates, at least one replica will have "sees" them both.

          Now, the latter guarantee cannot be provided by hints because we can't guarantee hints delivery in face of failures. Typically, if I write hints on a node and that node dies in a fire before that hint it delivered, it will never be delivered. We need a distributed hint mechanism if you will, and that's what the batch log gives us.

          Show
          slebresne Sylvain Lebresne added a comment - Why do we need this at all? Since replicas are in charge of updating MV then normal hints should perform the same function as batchlog except without the performance hint in the normal case. Allow me to sum up how we deal with consistency guarantees, why we do it this way and why I don't think hints work. I'm sorry if that response is a bit verbose but as this is the most important thing of this ticket imo, I think it bears repeating and making sure we're all on the same page. The main guarantee we have to provide here is that MV are eventually consistent with their base table. In other words, whatever failure scenarios we run into, we should never have an inconsistency that never gets resolved. The canonical example of why this is not a given is we have a column c = 2 in the base table that is also in a MV PK, and we have 2 concurrent updates A (sets c = 3 ) and B (sets c = 4 ). Without any kind of protection, we could end up with the MV permanently having 2 entries, one of A and one for B, which is incorrect (which should eventually converge to the update that has the biggest timestamp since that's what the base table will keep). To the best of my knowledge, there is 2 fundamental components to avoiding such permanent inconsistency in the currently written patch/approach: On each replica, we synchronize/serialize the read-before-write done on the base table. This guarantees that we won't have A and B racing on a single base-table replica. Or, in other words, if the same replica sees both update (where "sees" means "do the read-before-write-and-update-MV-accordingly" dance), then it will properly update the MV. And since each base-table replica updates all MV-table replica, it's enough that a single base-table replica sees both update to guarantee eventually consistent of the MV. But we do need to guarantee at least one such base-table replica sees both updates and that's the 2nd component. To provided that latter guarantee, we first put each base-table update that include MV updates in the batchlog on the coordinator, and we only remove it from the batchlog once a QUORUM of replica have aknowledged the write (this is importantly not dependent of the CL, eventual consistency must be guaranteed whatever CL you use). That guarantees us that until a QUORUM of replica have seen the update, we'll keep replaying it, which in turns guarantees us that for any 2 updates, at least one replica will have "sees" them both. Now, the latter guarantee cannot be provided by hints because we can't guarantee hints delivery in face of failures. Typically, if I write hints on a node and that node dies in a fire before that hint it delivered, it will never be delivered. We need a distributed hint mechanism if you will, and that's what the batch log gives us.
          Hide
          jbellis Jonathan Ellis added a comment -

          Removed the need for a Coordinator batchlog when running RF=1

          Why do we need this at all? Since replicas are in charge of updating MV then normal hints should perform the same function as batchlog except without the performance hint in the normal case.

          Show
          jbellis Jonathan Ellis added a comment - Removed the need for a Coordinator batchlog when running RF=1 Why do we need this at all? Since replicas are in charge of updating MV then normal hints should perform the same function as batchlog except without the performance hint in the normal case.
          Hide
          tjake T Jake Luciani added a comment -

          I think a better test will be if we do a BL of size 5 vs no BL with 5 MV, will run that next

          Show
          tjake T Jake Luciani added a comment - I think a better test will be if we do a BL of size 5 vs no BL with 5 MV, will run that next
          Hide
          tjake T Jake Luciani added a comment - - edited

          We've made some performance related changes to the write path. I'll walk through the results.
          (thanks to Alan Boudreault for running these benchmarks)

          • Fixed timeout bug (this caused all the timeout exceptions we were seeing)
          • Removed the need for a Coordinator batchlog when running RF=1
          • Made the Base -> View batchlog update local only. Since each base replica is paired to a view replica we only need to guarantee durability of the batchlog on the local base node.
          • Made the base -> view mutations async. Once we write to the local batchlog, we don't care if the actual mutations are sent, it's best effort. So we can fire and forget these and update the base memtable.

          The performance hit is caused mostly 50% Batchlog and 50% Read before write.

          TL;DR anything we can do to make BL or reads faster in trunk we will get improvements on MV writes....

          I'll start with the graph that I think is most relevant. RF=3 Batchlog vs RF=3 Materialized View

          http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=mv-2015-07-15%2Fcassandra-mv-vs-si-benchmark-rf3-batchlog.log&metric=op_rate&operation=3_user_logged&smoothing=1&show_aggregates=true&xmin=0&xmax=1092.08&ymin=0&ymax=20973.7

          50% slower than 2i. >60% slower than regular writes


          This next graph shows RF=1 results (These are not a realistic benchmark IMO but these are the numbers)

          http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=mv-2015-07-15%2Fcassandra-mv-vs-si-benchmark-rf1.log&metric=op_rate&operation=3_user&smoothing=1&show_aggregates=true&xmin=0&xmax=621.28&ymin=0&ymax=131803.1

          50% slower than 2i. >60% slower than regular writes

          Show
          tjake T Jake Luciani added a comment - - edited We've made some performance related changes to the write path. I'll walk through the results. (thanks to Alan Boudreault for running these benchmarks) Fixed timeout bug (this caused all the timeout exceptions we were seeing) Removed the need for a Coordinator batchlog when running RF=1 Made the Base -> View batchlog update local only. Since each base replica is paired to a view replica we only need to guarantee durability of the batchlog on the local base node. Made the base -> view mutations async. Once we write to the local batchlog, we don't care if the actual mutations are sent, it's best effort. So we can fire and forget these and update the base memtable. The performance hit is caused mostly 50% Batchlog and 50% Read before write. TL;DR anything we can do to make BL or reads faster in trunk we will get improvements on MV writes.... I'll start with the graph that I think is most relevant. RF=3 Batchlog vs RF=3 Materialized View http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=mv-2015-07-15%2Fcassandra-mv-vs-si-benchmark-rf3-batchlog.log&metric=op_rate&operation=3_user_logged&smoothing=1&show_aggregates=true&xmin=0&xmax=1092.08&ymin=0&ymax=20973.7 50% slower than 2i. >60% slower than regular writes This next graph shows RF=1 results (These are not a realistic benchmark IMO but these are the numbers) http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=mv-2015-07-15%2Fcassandra-mv-vs-si-benchmark-rf1.log&metric=op_rate&operation=3_user&smoothing=1&show_aggregates=true&xmin=0&xmax=621.28&ymin=0&ymax=131803.1 50% slower than 2i. >60% slower than regular writes
          Hide
          slebresne Sylvain Lebresne added a comment -

          [brianmhess] Yes typically.

          And to clarify, I'm not saying lifting that constraint is impossible or unthinkable, just that we shouldn't lift it before having made sure it's 1) actually useful and 2) not gonna limit us in the future.

          Show
          slebresne Sylvain Lebresne added a comment - [brianmhess] Yes typically. And to clarify, I'm not saying lifting that constraint is impossible or unthinkable, just that we shouldn't lift it before having made sure it's 1) actually useful and 2) not gonna limit us in the future.
          Hide
          brianmhess Brian Hess added a comment -

          Just for clarity Sylvain Lebresne - you'd have to make the first name the Partition Key and the rest of the Primary Keys of the base table as Clustering Columns, correct?

          Show
          brianmhess Brian Hess added a comment - Just for clarity Sylvain Lebresne - you'd have to make the first name the Partition Key and the rest of the Primary Keys of the base table as Clustering Columns, correct?
          Hide
          slebresne Sylvain Lebresne added a comment -

          For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right?

          That's correct. We have no way of guaranteeing that your "just name or address" uniquely identifies a row in the base table (and most probably, it doesn't). So what would we do when multiple entries in the base class conflict in the MV? The only option would be to keep the most recent entry, but that would be confusing as hell and we don't want to do that (at the very least, not in that initial ticket). And while we could allows some form of aggregation in that case, that is also definitively at best a future feature.

          But more importantly, it's not really limiting at all. If you want to have a MV to search users by names for instance, just put the name first in the MV PK. To have the rest of the column in the PK won't prevent you for doing your search by name, it'll just guarantee you that you do see all your using have that name, not just the last one updated.

          Show
          slebresne Sylvain Lebresne added a comment - For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right? That's correct. We have no way of guaranteeing that your "just name or address" uniquely identifies a row in the base table (and most probably, it doesn't). So what would we do when multiple entries in the base class conflict in the MV? The only option would be to keep the most recent entry, but that would be confusing as hell and we don't want to do that (at the very least, not in that initial ticket). And while we could allows some form of aggregation in that case, that is also definitively at best a future feature. But more importantly, it's not really limiting at all. If you want to have a MV to search users by names for instance, just put the name first in the MV PK. To have the rest of the column in the PK won't prevent you for doing your search by name, it'll just guarantee you that you do see all your using have that name, not just the last one updated.
          Hide
          jjordan Jeremiah Jordan added a comment -

          I don't follow the rationale and that seems over-limiting. For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right?

          You have to have all the PK columns from the base table in the clustering of the MV so that you can track deletions/updates correctly.

          Show
          jjordan Jeremiah Jordan added a comment - I don't follow the rationale and that seems over-limiting. For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right? You have to have all the PK columns from the base table in the clustering of the MV so that you can track deletions/updates correctly.
          Hide
          jkrupan Jack Krupansky added a comment -

          force users to include all columns from the original PK in the MV PK.

          I don't follow the rationale and that seems over-limiting. For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right?

          Show
          jkrupan Jack Krupansky added a comment - force users to include all columns from the original PK in the MV PK. I don't follow the rationale and that seems over-limiting. For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right?
          Hide
          carlyeks Carl Yeksigian added a comment -

          +1, but I will address all of the comments from Josh first before making these changes.

          Show
          carlyeks Carl Yeksigian added a comment - +1, but I will address all of the comments from Josh first before making these changes.
          Hide
          jbellis Jonathan Ellis added a comment -

          +1

          Show
          jbellis Jonathan Ellis added a comment - +1
          Hide
          slebresne Sylvain Lebresne added a comment -

          The discussion was on how, after a TTL expires and we are persisting null values, it should be handled.

          I'd argue there has been more than one discussion

          All I want is to make sure we agree the following changes should be made to the patch of this ticket (if they hasn't been done already):

          1. lift the limitation of only "a single column that was not part of your original primary key".
          2. force users to include all columns from the original PK in the MV PK. In other words, I agree with Jonathan's remark above that "we shouldn't silently change the definition from what was given".
          3. require WHERE x IS NOT NULL for every column in the MV primary key.
          Show
          slebresne Sylvain Lebresne added a comment - The discussion was on how, after a TTL expires and we are persisting null values, it should be handled. I'd argue there has been more than one discussion All I want is to make sure we agree the following changes should be made to the patch of this ticket (if they hasn't been done already): lift the limitation of only "a single column that was not part of your original primary key". force users to include all columns from the original PK in the MV PK. In other words, I agree with Jonathan's remark above that "we shouldn't silently change the definition from what was given". require WHERE x IS NOT NULL for every column in the MV primary key.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Right now, it is doing the max instead of min, but I think that I had gotten them mixed up, so I'll make sure it is correct.

          TTL expiration is automatic, as they are known from the update, and are just included with the MV update. The discussion was on how, after a TTL expires and we are persisting null values, it should be handled.

          Show
          carlyeks Carl Yeksigian added a comment - Right now, it is doing the max instead of min, but I think that I had gotten them mixed up, so I'll make sure it is correct. TTL expiration is automatic, as they are known from the update, and are just included with the MV update. The discussion was on how, after a TTL expires and we are persisting null values, it should be handled.
          Hide
          jkrupan Jack Krupansky added a comment -

          What CL will apply when MV rows are deleted on TTL expiration?

          Presumably each of the replicas of the base table will have its TTL expiration triggering roughly at the same time, each local change presumably triggering a delete of the MV, but the MV has replicas as well.

          Maybe ANY is reasonable for CL for MV update on TTL since the app is not performing an explicit operation with explicit expectations.

          Show
          jkrupan Jack Krupansky added a comment - What CL will apply when MV rows are deleted on TTL expiration? Presumably each of the replicas of the base table will have its TTL expiration triggering roughly at the same time, each local change presumably triggering a delete of the MV, but the MV has replicas as well. Maybe ANY is reasonable for CL for MV update on TTL since the app is not performing an explicit operation with explicit expectations.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I was trying to explain the issue that you are here as well – not how it currently works.

          Cool then, sorry the confusion. I guess the one small that is not implemented though, is that you currently only allow a single column that is not a base table PK in the MV pk, while we could allow as many as we want as long as we take the min TTL. Or is that also already implemented?

          Show
          slebresne Sylvain Lebresne added a comment - I was trying to explain the issue that you are here as well – not how it currently works. Cool then, sorry the confusion. I guess the one small that is not implemented though, is that you currently only allow a single column that is not a base table PK in the MV pk, while we could allow as many as we want as long as we take the min TTL. Or is that also already implemented?
          Hide
          carlyeks Carl Yeksigian added a comment -

          Carl hinted at some problem with that suggestion, but I haven't understood his point. So we should discuss that objection but 'a priori' I think that suggestion work

          I was trying to explain the issue that you are here as well – not how it currently works. It currently is implemented the way that you have described it (TTL on MV PK).

          Show
          carlyeks Carl Yeksigian added a comment - Carl hinted at some problem with that suggestion, but I haven't understood his point. So we should discuss that objection but 'a priori' I think that suggestion work I was trying to explain the issue that you are here as well – not how it currently works. It currently is implemented the way that you have described it (TTL on MV PK).
          Hide
          jbellis Jonathan Ellis added a comment -

          Makes sense.

          Show
          jbellis Jonathan Ellis added a comment - Makes sense.
          Hide
          slebresne Sylvain Lebresne added a comment -

          So you're saying that IS NOT NULL should treat TTL'd columns as if they were already null?

          I'm not, that would be terribly confusing (I personally couldn't live with it). I'm saying that if initially we only support the case when all MV PK columns have IS NOT NULL, then in the presence of TTL, a row should disappear from the MV as soon as the first MV PK column expires. So, as I said earlier, I'm pretty sure we can get by by TTLing the whole MV row with the smallest TTL of a MV PK column (Carl hinted at some problem with that suggestion, but I haven't understood his point. So we should discuss that objection but 'a priori' I think that suggestion work).

          However, I'm also saying that I don't know how to handle that expiring case as soon as we don't have the IS NOT NULL condition on some of the column, because in that case what we should do when a column expire is replacing the existing MV entry by a new one that have a different PK (one with a null for the newly expired column), and that's hard to do.

          Show
          slebresne Sylvain Lebresne added a comment - So you're saying that IS NOT NULL should treat TTL'd columns as if they were already null? I'm not, that would be terribly confusing (I personally couldn't live with it). I'm saying that if initially we only support the case when all MV PK columns have IS NOT NULL , then in the presence of TTL, a row should disappear from the MV as soon as the first MV PK column expires. So, as I said earlier, I'm pretty sure we can get by by TTLing the whole MV row with the smallest TTL of a MV PK column (Carl hinted at some problem with that suggestion, but I haven't understood his point. So we should discuss that objection but 'a priori' I think that suggestion work). However, I'm also saying that I don't know how to handle that expiring case as soon as we don't have the IS NOT NULL condition on some of the column, because in that case what we should do when a column expire is replacing the existing MV entry by a new one that have a different PK (one with a null for the newly expired column), and that's hard to do.
          Hide
          jbellis Jonathan Ellis added a comment -

          So you're saying that IS NOT NULL should treat TTL'd columns as if they were already null? That is a little confusing but I can live with it.

          Show
          jbellis Jonathan Ellis added a comment - So you're saying that IS NOT NULL should treat TTL'd columns as if they were already null? That is a little confusing but I can live with it.
          Hide
          slebresne Sylvain Lebresne added a comment -

          When we actually add support for null partition keys we can lift this requirement.

          I do want to note that there is a technical problem with allowing null in the MV primary keys (partition key or clustering columns alike), and that is TTLs. More precisely, if a column c in the MV PK has a TTL, we would need to update the MV once c expires, replacing the entry by one where c is now null. And a priori, I don't see an even remotely efficient way to deal with that (keeping in mind that we collect expired column lazily, not when they expire).

          That is, I agree that requiring IS NOT NULL for all primary key columns is the safe option for 3.0, but just wanted to note that the TTL problem above seems to me like a big obstacle in removing that requirement.

          Show
          slebresne Sylvain Lebresne added a comment - When we actually add support for null partition keys we can lift this requirement. I do want to note that there is a technical problem with allowing null in the MV primary keys (partition key or clustering columns alike), and that is TTLs. More precisely, if a column c in the MV PK has a TTL, we would need to update the MV once c expires, replacing the entry by one where c is now null . And a priori, I don't see an even remotely efficient way to deal with that (keeping in mind that we collect expired column lazily, not when they expire). That is, I agree that requiring IS NOT NULL for all primary key columns is the safe option for 3.0, but just wanted to note that the TTL problem above seems to me like a big obstacle in removing that requirement.
          Hide
          mbroecheler Matthias Broecheler added a comment -

          +1

          Show
          mbroecheler Matthias Broecheler added a comment - +1
          Hide
          jbellis Jonathan Ellis added a comment -

          (Created 9809 for full WHERE support.)

          Show
          jbellis Jonathan Ellis added a comment - (Created 9809 for full WHERE support.)
          Hide
          jbellis Jonathan Ellis added a comment -

          ... actually I think that is how we should deal with this for 3.0.

          For forwards compatibility, we should require WHERE x IS NOT NULL for all partition key columns.

          When we actually add support for null partition keys we can lift this requirement. And then we can extend WHERE to be more general as well.

          Show
          jbellis Jonathan Ellis added a comment - ... actually I think that is how we should deal with this for 3.0. For forwards compatibility, we should require WHERE x IS NOT NULL for all partition key columns. When we actually add support for null partition keys we can lift this requirement. And then we can extend WHERE to be more general as well.
          Hide
          jbellis Jonathan Ellis added a comment -

          It needs to be supported that rows are ignored in the view if their column values would lead to null values in the primary key of the view. Otherwise, view maintenance becomes impractical for sparsely populated tables.

          This is easily supported syntactically without special directives:

          CREATE MATERIALIZED VIEW users_by_ssn AS
          SELECT ...
          FROM users
          WHERE ssn IS NOT NULL
          PRIMARY KEY (ssn, user_id)
          

          Support for WHERE clause in MV is on the roadmap.

          Show
          jbellis Jonathan Ellis added a comment - It needs to be supported that rows are ignored in the view if their column values would lead to null values in the primary key of the view. Otherwise, view maintenance becomes impractical for sparsely populated tables. This is easily supported syntactically without special directives: CREATE MATERIALIZED VIEW users_by_ssn AS SELECT ... FROM users WHERE ssn IS NOT NULL PRIMARY KEY (ssn, user_id) Support for WHERE clause in MV is on the roadmap.
          Hide
          mbroecheler Matthias Broecheler added a comment -

          My 2 cents on the null discussion:
          It needs to be supported that rows are ignored in the view if their column values would lead to null values in the primary key of the view. Otherwise, view maintenance becomes impractical for sparsely populated tables. Imagine you have a large table of users but only 10% or so of those have a SSN. If you create a view by SSN you are creating a hotspot which makes it impractical.
          However, there are situations where including null values is perfectly reasonable under the assumption that the combination of the remaining primary key column values leads to a sufficiently high selectivity. Problem is, without the user giving us some sort of information the two cases won't be distinguishable.
          Hence, it may make sense to introduce a WITH NULL or WITHOUT NULL qualifier in the view definition to give the user control over the behavior with WITHOUT NULL being the default option since its safer (i.e. lower likelihood of hotspots).

          Show
          mbroecheler Matthias Broecheler added a comment - My 2 cents on the null discussion: It needs to be supported that rows are ignored in the view if their column values would lead to null values in the primary key of the view. Otherwise, view maintenance becomes impractical for sparsely populated tables. Imagine you have a large table of users but only 10% or so of those have a SSN. If you create a view by SSN you are creating a hotspot which makes it impractical. However, there are situations where including null values is perfectly reasonable under the assumption that the combination of the remaining primary key column values leads to a sufficiently high selectivity. Problem is, without the user giving us some sort of information the two cases won't be distinguishable. Hence, it may make sense to introduce a WITH NULL or WITHOUT NULL qualifier in the view definition to give the user control over the behavior with WITHOUT NULL being the default option since its safer (i.e. lower likelihood of hotspots).
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          I don't think this is the right solution, because sometimes denormalizing the null as partition key is the right thing to do. Which means we do need to support using columns that are not marked NOT NULL in MV PK. Adding a warning in that case is appealing but I think history shows that this won't actually help the people who need it. (See: everyone shooting themselves in the foot with ALLOW FILTERING queries.)

          So ultimately, if we're not going to disallow it entirely (and we shouldn't), then we really haven't gained very much. Especially if the band aid is as ugly as this one.

          Show
          jbellis Jonathan Ellis added a comment - - edited I don't think this is the right solution, because sometimes denormalizing the null as partition key is the right thing to do. Which means we do need to support using columns that are not marked NOT NULL in MV PK. Adding a warning in that case is appealing but I think history shows that this won't actually help the people who need it. (See: everyone shooting themselves in the foot with ALLOW FILTERING queries.) So ultimately, if we're not going to disallow it entirely (and we shouldn't), then we really haven't gained very much. Especially if the band aid is as ugly as this one.
          Hide
          tupshin Tupshin Harper added a comment -

          I find myself disagreeing with the hard requirement that all rows in the table must show up in the materialized views. While it would be nice, I believe that clearly documenting the limitation and providing a couple of reasonable choices is far preferable then encouraging using rope sufficient to hang the user.

          My suggestion:

          • Create a formal notion of NOT NULL columns in the schema that can be applied to a table, irrespective of any MV usage.
          • Columns that are NOT NULL would have the exact same restrictions as PK columns, namely that they need to be included in all inserts and updates (with the possible exception of LWT updates)
          • Document (and warn in cqlsh) the fact that if you create a MV with a PK using a nullable column from the table, then those values will not be in the view

          It seems to me like this is a far less dangerous (and in many ways less surprising) than automatically creating a hotspot in the MV because lots of data with NULLs get added.

          Now with 8099 supporting NULLs for clustering columns, this might only apply to columns that would be a partition key in the MV, and that seems appealing. But I can't talk myself into liking inserting nulls into a MV partition key.

          Show
          tupshin Tupshin Harper added a comment - I find myself disagreeing with the hard requirement that all rows in the table must show up in the materialized views. While it would be nice, I believe that clearly documenting the limitation and providing a couple of reasonable choices is far preferable then encouraging using rope sufficient to hang the user. My suggestion: Create a formal notion of NOT NULL columns in the schema that can be applied to a table, irrespective of any MV usage. Columns that are NOT NULL would have the exact same restrictions as PK columns, namely that they need to be included in all inserts and updates (with the possible exception of LWT updates) Document (and warn in cqlsh) the fact that if you create a MV with a PK using a nullable column from the table, then those values will not be in the view It seems to me like this is a far less dangerous (and in many ways less surprising) than automatically creating a hotspot in the MV because lots of data with NULLs get added. Now with 8099 supporting NULLs for clustering columns, this might only apply to columns that would be a partition key in the MV, and that seems appealing. But I can't talk myself into liking inserting nulls into a MV partition key.
          Hide
          slebresne Sylvain Lebresne added a comment -

          However, if we have the view also have a null for the first row, so it would be (0, null, 0), wouldn't we also expect to see a row for (1, null, 1)?
          The fact that a cell was non-null and is now null shouldn't really have any different behavior than if it was null from the start.

          Certainly. Having a column null is in CQL by definition the equivalent of the column not having value. I'm not in any way suggesting we should make a difference here.

          That is, the MV will have the same columns in its primary key as the base table's, but with different partition keys and different orders for clustering columns. I could be wrong, but that seems like the biggest use case. And in that use case the keys are all non-null.

          The question is if this should be the only supported case. If you read the comments above, some people strongly disagree. So we're debating how it should work when that's not the case and how to support it. Everybody agrees the case you're describing is the easy case.

          Show
          slebresne Sylvain Lebresne added a comment - However, if we have the view also have a null for the first row, so it would be (0, null, 0), wouldn't we also expect to see a row for (1, null, 1)? The fact that a cell was non-null and is now null shouldn't really have any different behavior than if it was null from the start. Certainly. Having a column null is in CQL by definition the equivalent of the column not having value. I'm not in any way suggesting we should make a difference here. That is, the MV will have the same columns in its primary key as the base table's, but with different partition keys and different orders for clustering columns. I could be wrong, but that seems like the biggest use case. And in that use case the keys are all non-null. The question is if this should be the only supported case. If you read the comments above, some people strongly disagree. So we're debating how it should work when that's not the case and how to support it. Everybody agrees the case you're describing is the easy case.
          Hide
          brianmhess Brian Hess added a comment -

          A question to behavior here. If we do the following

          INSERT INTO t (k, t, v) VALUES (0, 0, 0) USING TTL 1000;
          UPDATE t SET v = null WHERE k = 0 AND t = 0;
          INSERT INTO t (k, t) VALUES (1, 1) USING TTL 1000;
          

          Then, doing a

          SELECT * FROM t;
          

          would result in getting both rows: (0, 0, null) and (1, 1, null)

          However, if we have the view also have a null for the first row, so it would be (0, null, 0), wouldn't we also expect to see a row for (1, null, 1)?
          The fact that a cell was non-null and is now null shouldn't really have any different behavior than if it was null from the start.

          Another point - I think the primary use case will be to mimic query tables. That is, the MV will have the same columns in its primary key as the base table's, but with different partition keys and different orders for clustering columns. I could be wrong, but that seems like the biggest use case. And in that use case the keys are all non-null.

          Show
          brianmhess Brian Hess added a comment - A question to behavior here. If we do the following INSERT INTO t (k, t, v) VALUES (0, 0, 0) USING TTL 1000; UPDATE t SET v = null WHERE k = 0 AND t = 0; INSERT INTO t (k, t) VALUES (1, 1) USING TTL 1000; Then, doing a SELECT * FROM t; would result in getting both rows: (0, 0, null) and (1, 1, null) However, if we have the view also have a null for the first row, so it would be (0, null, 0), wouldn't we also expect to see a row for (1, null, 1)? The fact that a cell was non-null and is now null shouldn't really have any different behavior than if it was null from the start. Another point - I think the primary use case will be to mimic query tables. That is, the MV will have the same columns in its primary key as the base table's, but with different partition keys and different orders for clustering columns. I could be wrong, but that seems like the biggest use case. And in that use case the keys are all non-null.
          Hide
          jbellis Jonathan Ellis added a comment -

          To have every row in the base table having a counterpart in the MV is something I would have expected.

          +10, this is the fundamental proposition that users will expect.

          Show
          jbellis Jonathan Ellis added a comment - To have every row in the base table having a counterpart in the MV is something I would have expected. +10, this is the fundamental proposition that users will expect.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Changing this would also imply that

          INSERT INTO (k,t) VALUES (0, 0);
          

          would create a row in the view as well for (0, null, 0).

          Absolutely. And to be clear, all I'm saying is, we should make sure we're all in agreement and comfortable with the semantic we implement (and we should be clear on what that semantic is exactly). To have every row in the base table having a counterpart in the MV is something I would have expected. Are we sure we are fine with giving up that property? Or to put it another way, if we do give up that property, I'd like to make sure we understand why we are doing so, and I'm not all that clear on that. Is everyone except me convinced that a null for a column that is part of the MV primary key should mean the row is not in the MV?

          Show
          slebresne Sylvain Lebresne added a comment - Changing this would also imply that INSERT INTO (k,t) VALUES (0, 0); would create a row in the view as well for (0, null, 0). Absolutely. And to be clear, all I'm saying is, we should make sure we're all in agreement and comfortable with the semantic we implement (and we should be clear on what that semantic is exactly). To have every row in the base table having a counterpart in the MV is something I would have expected. Are we sure we are fine with giving up that property? Or to put it another way, if we do give up that property, I'd like to make sure we understand why we are doing so, and I'm not all that clear on that. Is everyone except me convinced that a null for a column that is part of the MV primary key should mean the row is not in the MV?
          Hide
          carlyeks Carl Yeksigian added a comment -

          Aleksey Yeschenko: I added support to allow for changing the table options with CREATE MATERIALIZED VIEW, so it is possible to set those options when creating the view.

          Sylvain Lebresne:

          Is it how this currently work?

          Yes, this is how it currently works; if we have any NULLs in the view primary key, we do not insert that into the view.

          Changing this would also imply that

          INSERT INTO (k,t) VALUES (0, 0);
          

          would create a row in the view as well for (0, null, 0).

          Assuming we are fine with the semantic of "the view only has a row if the base class has a row and none of the non-PK columns of the base class that are in the MV Pk are null" (which I'm not sure I am, and is an important decision for sure), then I think we've be fine with more than one column on that front by simply using the min TTL of all the columns.

          It works this was under the assumption that we don't have to represent the null columns which are introduced when the current ttl'd columns expire; if that changes, we'll have to redo all of the current ttl work, which is fine as long as we have both a clear idea of the work involved to introduce this, and a real use case we are trying to support.

          Overall, I'd be more surprised that a materialized view is including rows for values that were never set, or which were specifically deleted, not that it isn't including those values.

          Show
          carlyeks Carl Yeksigian added a comment - Aleksey Yeschenko : I added support to allow for changing the table options with CREATE MATERIALIZED VIEW , so it is possible to set those options when creating the view. Sylvain Lebresne : Is it how this currently work? Yes, this is how it currently works; if we have any NULLs in the view primary key, we do not insert that into the view. Changing this would also imply that INSERT INTO (k,t) VALUES (0, 0); would create a row in the view as well for (0, null, 0). Assuming we are fine with the semantic of "the view only has a row if the base class has a row and none of the non-PK columns of the base class that are in the MV Pk are null" (which I'm not sure I am, and is an important decision for sure), then I think we've be fine with more than one column on that front by simply using the min TTL of all the columns. It works this was under the assumption that we don't have to represent the null columns which are introduced when the current ttl'd columns expire; if that changes, we'll have to redo all of the current ttl work, which is fine as long as we have both a clear idea of the work involved to introduce this, and a real use case we are trying to support. Overall, I'd be more surprised that a materialized view is including rows for values that were never set, or which were specifically deleted, not that it isn't including those values.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Ok, let me back up a minute here. If I understand correctly, the current semantic is that if a view has a non-PK column from the base table, then the MV only has an entry if that non-PK column is not null? In other words, given:

          CREATE TABLE t (k int, t int, v int, PRIMARY KEY (k, t));
          CREATE MATERIALIZED VIEW v AS SELECT * FROM t PRIMARY KEY (t, v, k);
          
          INSERT INTO t (k, t, v) VALUES (0, 0, 0) USING TTL 1000;
          UPDATE t SET v = null WHERE k = 0 AND t = 0;
          

          If you do:

          SELECT * FROM v;
          

          you get no results, even though the view was created with a SELECT * and there is a row in the base table. Is it how this currently work?

          Perhaps more importantly, is it how we want it to work?

          That is, doesn't it feel a bit inconsistent not to have any entry in the view? That typically mean that if you query the view to find how many rows have t == 0, you won't have an accurate answer in particular. Wouldn't it be better to get:

          SELECT * FROM v;
          t | v    | k
          -------------
          0 | null | 0
          

          Because that is the only column which is supported, we TTL the row in the view so that it is taken care of the same as the base TTL.

          Assuming we are fine with the semantic of "the view only has a row if the base class has a row and none of the non-PK columns of the base class that are in the MV Pk are null" (which I'm not sure I am, and is an important decision for sure), then I think we've be fine with more than one column on that front by simply using the min TTL of all the columns.

          Show
          slebresne Sylvain Lebresne added a comment - Ok, let me back up a minute here. If I understand correctly, the current semantic is that if a view has a non-PK column from the base table, then the MV only has an entry if that non-PK column is not null? In other words, given: CREATE TABLE t (k int, t int, v int, PRIMARY KEY (k, t)); CREATE MATERIALIZED VIEW v AS SELECT * FROM t PRIMARY KEY (t, v, k); INSERT INTO t (k, t, v) VALUES (0, 0, 0) USING TTL 1000; UPDATE t SET v = null WHERE k = 0 AND t = 0; If you do: SELECT * FROM v; you get no results, even though the view was created with a SELECT * and there is a row in the base table. Is it how this currently work? Perhaps more importantly, is it how we want it to work? That is, doesn't it feel a bit inconsistent not to have any entry in the view? That typically mean that if you query the view to find how many rows have t == 0 , you won't have an accurate answer in particular. Wouldn't it be better to get: SELECT * FROM v; t | v | k ------------- 0 | null | 0 Because that is the only column which is supported, we TTL the row in the view so that it is taken care of the same as the base TTL. Assuming we are fine with the semantic of "the view only has a row if the base class has a row and none of the non-PK columns of the base class that are in the MV Pk are null" (which I'm not sure I am, and is an important decision for sure), then I think we've be fine with more than one column on that front by simply using the min TTL of all the columns.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I've got an issue with how MVs are represented in schema, but, more user-visibly, a minor issue on the DDL side. Or, rather, two issues:

          1. We have CREATE MATERIALIZED VIEW, DROP MATERIALIZED VIEW, but we reuse ALTER TABLE to modify it. We should have ALTER MATERIALIZED VIEW instead.
          2. It's currently impossible to override things like compaction and compression parameters in CREATE MATERIALIZED VIEW. Users shouldn't need to issue a follow up ALTER MATERIALIZED VIEW to set the parameters. CREATE MATERIALIZED VIEW should support WITH.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I've got an issue with how MVs are represented in schema, but, more user-visibly, a minor issue on the DDL side. Or, rather, two issues: 1. We have CREATE MATERIALIZED VIEW , DROP MATERIALIZED VIEW , but we reuse ALTER TABLE to modify it. We should have ALTER MATERIALIZED VIEW instead. 2. It's currently impossible to override things like compaction and compression parameters in CREATE MATERIALIZED VIEW . Users shouldn't need to issue a follow up ALTER MATERIALIZED VIEW to set the parameters. CREATE MATERIALIZED VIEW should support WITH .
          Hide
          carlyeks Carl Yeksigian added a comment -

          That said, as far as I can tell that's a problem even with a single column that is not in the original PK. Yet, it's supposed to be supported by the patch. How does it work? What happen once that column expires?

          Because that is the only column which is supported, we TTL the row in the view so that it is taken care of the same as the base TTL. The difference here would be that there would also need to be a new value created which is the NULL that the TTL will result in after it has expired. We would either need special logic to handle a TTL on compaction (and lose some eventual consistency), or generate a future timestamped value for the NULL whenever we create a TTL that will become valid when the TTL expires on the base and view columns.

          That part I don't follow.

          We don't have semantics for regular columns to be set null, instead that indicates that they are tombstones. We can't distinguish between an explicitly set NULL and one that happened because we either tombstoned the columns or we never set the included columns, so we will have to use the NULL value for all unset and tombstoned columns.

          Show
          carlyeks Carl Yeksigian added a comment - That said, as far as I can tell that's a problem even with a single column that is not in the original PK. Yet, it's supposed to be supported by the patch. How does it work? What happen once that column expires? Because that is the only column which is supported, we TTL the row in the view so that it is taken care of the same as the base TTL. The difference here would be that there would also need to be a new value created which is the NULL that the TTL will result in after it has expired. We would either need special logic to handle a TTL on compaction (and lose some eventual consistency), or generate a future timestamped value for the NULL whenever we create a TTL that will become valid when the TTL expires on the base and view columns. That part I don't follow. We don't have semantics for regular columns to be set null, instead that indicates that they are tombstones. We can't distinguish between an explicitly set NULL and one that happened because we either tombstoned the columns or we never set the included columns, so we will have to use the NULL value for all unset and tombstoned columns.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Because one isn't implicitly a hot spot

          Ok, gotcha, MV table where the partition key is a single column that is not a PK from the base table will create implicit hot spots. I'm personally happy to disallow that case. That's still a very specific case: MV tables with multiple partition key columns where one of them is a PK in the base table (but not all of them are) don't have this problem, so the the underlying question of how we support that is still open.

          we would have to insert new NULL values after a TTL expired

          Good point. I can definitively see how TTLs are a problem for including non-PK columns from the base table into a MV PK. That said, as far as I can tell that's a problem even with a single column that is not in the original PK. Yet, it's supposed to be supported by the patch. How does it work? What happen once that column expires?

          This means that we will have to insert a NULL value for all unset columns, as well as all tombstone'd columns, since regular columns are not explicitly null.

          That part I don't follow.

          Show
          slebresne Sylvain Lebresne added a comment - Because one isn't implicitly a hot spot Ok, gotcha, MV table where the partition key is a single column that is not a PK from the base table will create implicit hot spots. I'm personally happy to disallow that case. That's still a very specific case: MV tables with multiple partition key columns where one of them is a PK in the base table (but not all of them are) don't have this problem, so the the underlying question of how we support that is still open. we would have to insert new NULL values after a TTL expired Good point. I can definitively see how TTLs are a problem for including non-PK columns from the base table into a MV PK. That said, as far as I can tell that's a problem even with a single column that is not in the original PK. Yet, it's supposed to be supported by the patch. How does it work? What happen once that column expires? This means that we will have to insert a NULL value for all unset columns, as well as all tombstone'd columns, since regular columns are not explicitly null. That part I don't follow.
          Hide
          tjake T Jake Luciani added a comment -

          How is what you say different from writing a new value at RF=3 and only one replica getting the new value?

          Because one isn't implicitly a hot spot

          Show
          tjake T Jake Luciani added a comment - How is what you say different from writing a new value at RF=3 and only one replica getting the new value? Because one isn't implicitly a hot spot
          Hide
          slebresne Sylvain Lebresne added a comment -

          NULL, however, is not something people think about since it's the default value implicitly for Cassandra.

          Again, I'm not trying to say that it's not something we should consider. I'm mainly saying that imo we should do CASSANDRA-9796 anyway so we don't have the internal limitation. Maybe we can impose limitations for the sake of protecting against foot-shooting, like requiring that at least one column of the MV partition key is a primary key column of the base table (whithout requiring all of them to be).

          If you are writing to a table with RF=3 and only one replica has a value but the other two do not then we will make one MV replica have a defined partition key and the other two replicas will have a NULL

          Not sure I follow. Nulls should just be "just another value" and there should really be any special casing for them. How is what you say different from writing a new value at RF=3 and only one replica getting the new value?

          Show
          slebresne Sylvain Lebresne added a comment - NULL, however, is not something people think about since it's the default value implicitly for Cassandra. Again, I'm not trying to say that it's not something we should consider. I'm mainly saying that imo we should do CASSANDRA-9796 anyway so we don't have the internal limitation. Maybe we can impose limitations for the sake of protecting against foot-shooting, like requiring that at least one column of the MV partition key is a primary key column of the base table (whithout requiring all of them to be). If you are writing to a table with RF=3 and only one replica has a value but the other two do not then we will make one MV replica have a defined partition key and the other two replicas will have a NULL Not sure I follow. Nulls should just be "just another value" and there should really be any special casing for them. How is what you say different from writing a new value at RF=3 and only one replica getting the new value?
          Hide
          carlyeks Carl Yeksigian added a comment -

          Currently, NULL is used to insert tombstones into the table:

                  // insert exactly the amount of tombstones that shouldn't trigger an exception
                  for (int i = 0; i < THRESHOLD; i++)
                      execute("INSERT INTO %s (a, b, c) VALUES ('key', 'column" + i + "', null);");
          

          This means that we will have to insert a NULL value for all unset columns, as well as all tombstone'd columns, since regular columns are not explicitly null. In addition, we would have to insert new NULL values after a TTL expired.

          Show
          carlyeks Carl Yeksigian added a comment - Currently, NULL is used to insert tombstones into the table: // insert exactly the amount of tombstones that shouldn't trigger an exception for ( int i = 0; i < THRESHOLD; i++) execute( "INSERT INTO %s (a, b, c) VALUES ('key', 'column" + i + "', null );" ); This means that we will have to insert a NULL value for all unset columns, as well as all tombstone'd columns, since regular columns are not explicitly null. In addition, we would have to insert new NULL values after a TTL expired.
          Hide
          tjake T Jake Luciani added a comment -

          Creating a MV where the partition key is for instance the gender of your users is equally going to create massive hotspots, but it's not like we can forbid people to do it.

          This is an obvious anti-pattern for a materialized view. NULL, however, is not something people think about since it's the default value implicitly for Cassandra. Especially when considering consistency issues in a dynamo system. If you are writing to a table with RF=3 and only one replica has a value but the other two do not then we will make one MV replica have a defined partition key and the other two replicas will have a NULL. Currently we don't allow this so the NULLs are left out.

          Show
          tjake T Jake Luciani added a comment - Creating a MV where the partition key is for instance the gender of your users is equally going to create massive hotspots, but it's not like we can forbid people to do it. This is an obvious anti-pattern for a materialized view. NULL, however, is not something people think about since it's the default value implicitly for Cassandra. Especially when considering consistency issues in a dynamo system. If you are writing to a table with RF=3 and only one replica has a value but the other two do not then we will make one MV replica have a defined partition key and the other two replicas will have a NULL. Currently we don't allow this so the NULLs are left out.
          Hide
          brianmhess Brian Hess added a comment -

          Moreover, if you were to handle NULLs in the application tier (say by the application using the string "NULL" or the INT -1 or whatever) you end up with exactly the same hot spot. This is a known issue for distributed systems in general (not just Cassandra), so I would note it and move on - just like we note that GENDER is a bad partition key today.

          Show
          brianmhess Brian Hess added a comment - Moreover, if you were to handle NULLs in the application tier (say by the application using the string "NULL" or the INT -1 or whatever) you end up with exactly the same hot spot. This is a known issue for distributed systems in general (not just Cassandra), so I would note it and move on - just like we note that GENDER is a bad partition key today.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I'm still not sure how handling a null partition key would work though in term of tokens. If you have a sparse column across millions of rows in the base table you would end up with a HUGE hotspot since all NULL keys would end up on the same partition (token X)

          We're saying it could work technically. Your point that it may not be a good idea is valid, but I could argue that creating hotspots by choosing poor partition keys is possible with or without nulls. Creating a MV where the partition key is for instance the gender of your users is equally going to create massive hotspots, but it's not like we can forbid people to do it (and if we were to somehow be able to enforce NOT NULL (which we can't) and force people to use it, you can be sure some people just end up using default values and thus generating the exact same problems. There is no such thing as stupid-proofing).

          If we fear nulls make it easier to shoot yourself in the foot, then maybe we can find some kind of limitation that are not too restrictive but make that shoot-footing less likely, but I suspect we'll just have to rely on having good best practice docs.

          Show
          slebresne Sylvain Lebresne added a comment - I'm still not sure how handling a null partition key would work though in term of tokens. If you have a sparse column across millions of rows in the base table you would end up with a HUGE hotspot since all NULL keys would end up on the same partition (token X) We're saying it could work technically. Your point that it may not be a good idea is valid, but I could argue that creating hotspots by choosing poor partition keys is possible with or without nulls. Creating a MV where the partition key is for instance the gender of your users is equally going to create massive hotspots, but it's not like we can forbid people to do it (and if we were to somehow be able to enforce NOT NULL (which we can't) and force people to use it, you can be sure some people just end up using default values and thus generating the exact same problems. There is no such thing as stupid-proofing). If we fear nulls make it easier to shoot yourself in the foot, then maybe we can find some kind of limitation that are not too restrictive but make that shoot-footing less likely, but I suspect we'll just have to rely on having good best practice docs.
          Hide
          tjake T Jake Luciani added a comment -

          I'll first note that post-8099 we do support null for clustering keys internally, so we can (and should) use that.

          Oh good! That'll be easy then for clustering keys. (we should note that in 3.0 docs)

          Which is why I would much prefer a general solution à la CASSANDRA-9796.

          Yeah our concern was trying to support something only allowed for MV tables would end up biting us.

          I'm still not sure how handling a null partition key would work though in term of tokens. If you have a sparse column across millions of rows in the base table you would end up with a HUGE hotspot since all NULL keys would end up on the same partition (token X)

          Show
          tjake T Jake Luciani added a comment - I'll first note that post-8099 we do support null for clustering keys internally, so we can (and should) use that. Oh good! That'll be easy then for clustering keys. (we should note that in 3.0 docs) Which is why I would much prefer a general solution à la CASSANDRA-9796 . Yeah our concern was trying to support something only allowed for MV tables would end up biting us. I'm still not sure how handling a null partition key would work though in term of tokens. If you have a sparse column across millions of rows in the base table you would end up with a HUGE hotspot since all NULL keys would end up on the same partition (token X)
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          And I can guarantee we'll regret that for the years to come. Which is why I would much prefer a general solution à la CASSANDRA-9796.

          This is my strong preference as well.

          Show
          iamaleksey Aleksey Yeschenko added a comment - And I can guarantee we'll regret that for the years to come. Which is why I would much prefer a general solution à la CASSANDRA-9796 . This is my strong preference as well.
          Hide
          slebresne Sylvain Lebresne added a comment -

          On the subject of null in primary keys, I'll first note that post-8099 we do support null for clustering keys internally, so we can (and should) use that.

          Partition keys are arguably a bit more of a problem. My own preference would be to actually do CASSANDRA-9796, as I think we should do that sooner or later anyway and we could make null supported there in a somewhat reasonable way.

          I'm proposing that the encoding should be internal, not exposed to the user.

          But MVs are currently directly exposed as read-only tables. So if we start doing something that is internal only and not exposed to the user, this will get hacky pretty quickly, forcing us to intercept the queries in SelectStatement and do whatever translation is necessary. And I can guarantee we'll regret that for the years to come. Which is why I would much prefer a general solution à la CASSANDRA-9796.

          Show
          slebresne Sylvain Lebresne added a comment - On the subject of null in primary keys, I'll first note that post-8099 we do support null for clustering keys internally, so we can (and should) use that. Partition keys are arguably a bit more of a problem. My own preference would be to actually do CASSANDRA-9796 , as I think we should do that sooner or later anyway and we could make null supported there in a somewhat reasonable way. I'm proposing that the encoding should be internal, not exposed to the user. But MVs are currently directly exposed as read-only tables. So if we start doing something that is internal only and not exposed to the user, this will get hacky pretty quickly, forcing us to intercept the queries in SelectStatement and do whatever translation is necessary. And I can guarantee we'll regret that for the years to come. Which is why I would much prefer a general solution à la CASSANDRA-9796 .
          Hide
          jbellis Jonathan Ellis added a comment -

          how do you take the token on a null partition key

          There are a number of options. Make it the first value in the token space, or the same as the token of empty blob, or the same as the token for a zero-element tuple, for instance. It doesn't really matter since token collisions are acceptable now.

          if you used it for clustering keys the query syntax would be non-standard

          I'm proposing that the encoding should be internal, not exposed to the user. Fundamentally we need to distinguish that (A, null) and (null, B) are different things. Whether we use a tuple or some custom encoding to do that I don't care.

          It wouldn't be any different than how we enforce it for PK today

          Which is exactly the problem, the syntax itself enforces that the PK is specified because that is how the row being updated is identified. But with this we'd have to require the user to add , not_null_column=X as part of the SET clause for every UPDATE.

          Show
          jbellis Jonathan Ellis added a comment - how do you take the token on a null partition key There are a number of options. Make it the first value in the token space, or the same as the token of empty blob, or the same as the token for a zero-element tuple, for instance. It doesn't really matter since token collisions are acceptable now. if you used it for clustering keys the query syntax would be non-standard I'm proposing that the encoding should be internal, not exposed to the user. Fundamentally we need to distinguish that (A, null) and (null, B) are different things. Whether we use a tuple or some custom encoding to do that I don't care. It wouldn't be any different than how we enforce it for PK today Which is exactly the problem, the syntax itself enforces that the PK is specified because that is how the row being updated is identified. But with this we'd have to require the user to add , not_null_column=X as part of the SET clause for every UPDATE.
          Hide
          tjake T Jake Luciani added a comment -

          just leaving out data for a null value is broken.

          I disagree. For example how do you take the token on a null partition key? (assuming your partition key is a non-PK column)

          What if we decided that MV partition key will always be encoded as a tuple?

          It doesn't solve the above problem and if you used it for clustering keys the query syntax would be non-standard (you can't query by base name anymore)

          I don't see how NOT NULL works except as an extension of INSERTS ONLY

          It wouldn't be any different than how we enforce it for PK today (if column is a PK column it can't be null).
          The only restriction is you can't alter the table later to make a column not NOT NULL. But I'm not sure how tombstones would work, you could only update to a new value.

          Show
          tjake T Jake Luciani added a comment - just leaving out data for a null value is broken. I disagree. For example how do you take the token on a null partition key? (assuming your partition key is a non-PK column) What if we decided that MV partition key will always be encoded as a tuple? It doesn't solve the above problem and if you used it for clustering keys the query syntax would be non-standard (you can't query by base name anymore) I don't see how NOT NULL works except as an extension of INSERTS ONLY It wouldn't be any different than how we enforce it for PK today (if column is a PK column it can't be null). The only restriction is you can't alter the table later to make a column not NOT NULL. But I'm not sure how tombstones would work, you could only update to a new value.
          Hide
          jbellis Jonathan Ellis added a comment -

          Re null handling: as I said initially, just leaving out data for a null value is broken. It's not good for a single column, and it's worse for multiple columns. [It sounds like you are saying it treats (A, null) == (null, B) == null which is not what we want either.]

          What if we decided that MV partition key will always be encoded as a tuple? Does that get around the "partition key cannot be null" problem if we have tuple(null) instead?

          (I don't see how NOT NULL works except as an extension of INSERTS ONLY, because otherwise you either have to introduce possibly racy read-before-write or you have to require specifying the NOT NULL column on every UPDATE.)

          Show
          jbellis Jonathan Ellis added a comment - Re null handling: as I said initially, just leaving out data for a null value is broken. It's not good for a single column, and it's worse for multiple columns. [It sounds like you are saying it treats (A, null) == (null, B) == null which is not what we want either.] What if we decided that MV partition key will always be encoded as a tuple? Does that get around the "partition key cannot be null" problem if we have tuple(null) instead? (I don't see how NOT NULL works except as an extension of INSERTS ONLY , because otherwise you either have to introduce possibly racy read-before-write or you have to require specifying the NOT NULL column on every UPDATE.)
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          Hate to interrupt this lively debate about null handling, but I have more feedback for Carl.

          Last set of style/lower level feedback here. Going to start tracing the various paths and confirm that the current logic affords the guarantees we think they do as well as look into the schema a bit more.

          A couple of questions:

          1. How clear to users is it that CAS operations are disabled by adding a MV to a CF? Is this going to be documentation-specific or is this going to be something we pop up in CQL as well?
          2. In MaterializedViewLongTest, why are you ignoring all Throwables from the BatchLogManager.instance.startBatchlogReplay().get() call?
          MaterializedView
          • Consider renaming cfModifiesSelectedColumn. Maybe "updateAffectsView(AbstractPartitionData apd)"?
            • Same w/MaterializedViewManager.cfModifiesSelectedColumn. "cfModifies" is unclear - what's the cf that's being referred to here, the legacy "ColumnFamily" nomenclature that's no longer used that's actually a PartitionUpdate?
          DeletionTimeArray
          • Looks like an accidental change to isLive snuck in
          Keyspace
          • I have some concerns about the garbage/performance impact of a call to MaterializedViewManager.touchesSelectedColumn on each Keyspace.apply. While I'm fine letting MV performance optimization efforts be a follow-on effort (largely due to the fact that the MV code is well abstracted), I'd like some more testing / assurances that the places where we've modified the critical path don't introduce performance regressions.
          RowUpdateBuilder
          • Need to rephrase the following error message: "since no clustering hasn't been provided". I realize you just copied the one that was already in addListEntry, but may as well fix the double-negatives while we're here.
          SystemKeyspace
          • Consider renaming the schema values for MATERIALIZEDVIEW_BUILDS and BUILT_MATERIALIZEDVIEWS to be consistent w/other schema entries. Maybe follow COMPACTIONS_IN_PROGRESS w/MATERIALIZED_VIEW_BUILDS_IN_PROGRESS = "materielizedviews_builds_in_progress"? The current values don't immediately differentiate themselves from one another to me.
          TrncateVerbHandler
          • We should probably move MV truncation into cfs.truncateBlocking as MV are children of the CFS, rather than have it being a part of the VerbHandler.
          StorageProxy
          • Need clearer differentiation between the 2 wrapBatchResponseHandler methods rather than just a comment. If initiating writes is a side-effect of the method, its name should reflect that.
          • I'm also not sure that comment is accurate as the difference between the 2 methods currently appears to be only around naturalEndpoint determination
          MaterializedViewLongTest
          • ttlTest looks like it should be a regular unit test rather than a long
          MaterializedViewTest
          • Comment on why you're swallowing exceptions
          • Comment tests to indicate the use-cases they're testing
          • Do we really need to sleep in 1000ms increments on a CREATE MATERIALIZED VIEW statement? (This is more a question on expected performance w/a base CF w/100 records than anything)
          • testAllTypes is going to atrophy when new types are introduced
          MaterializedViewUtilsTest
          • Needs comments
          nits
          • Unused imports in Keyspace.java
          • Double-whitespace in Keyspace.apply
          • Unused import in RowUpdateBuilder.java
          • Double-whitespace in StorageProxy.mutateAtomically
          • Remove left over debug logging in StorageProxy.mutateAtomically
          • MaterializedViewLongTest: double whitespacing, unused import
          • MaterializedViewTest: unused imports, double whitespace
          • MaterializedViewUtilsTest: unused imports
          Things I'd like to see performance tested
          • Pre-MV vs. Post-MV patch, regular read/write/mixed throughput and latency
          • Pre-MV vs. Post-MV patch, batchlog stress

          Given the changes to non-MV queries on the critical path, I think it's important we at least confirm we're not introducing any obvious performance regressions to those query types when excluding the MV code-paths from apply calls.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Hate to interrupt this lively debate about null handling, but I have more feedback for Carl. Last set of style/lower level feedback here. Going to start tracing the various paths and confirm that the current logic affords the guarantees we think they do as well as look into the schema a bit more. A couple of questions: How clear to users is it that CAS operations are disabled by adding a MV to a CF? Is this going to be documentation-specific or is this going to be something we pop up in CQL as well? In MaterializedViewLongTest, why are you ignoring all Throwables from the BatchLogManager.instance.startBatchlogReplay().get() call? MaterializedView Consider renaming cfModifiesSelectedColumn. Maybe "updateAffectsView(AbstractPartitionData apd)"? Same w/MaterializedViewManager.cfModifiesSelectedColumn. "cfModifies" is unclear - what's the cf that's being referred to here, the legacy "ColumnFamily" nomenclature that's no longer used that's actually a PartitionUpdate? DeletionTimeArray Looks like an accidental change to isLive snuck in Keyspace I have some concerns about the garbage/performance impact of a call to MaterializedViewManager.touchesSelectedColumn on each Keyspace.apply. While I'm fine letting MV performance optimization efforts be a follow-on effort (largely due to the fact that the MV code is well abstracted), I'd like some more testing / assurances that the places where we've modified the critical path don't introduce performance regressions. RowUpdateBuilder Need to rephrase the following error message: "since no clustering hasn't been provided". I realize you just copied the one that was already in addListEntry, but may as well fix the double-negatives while we're here. SystemKeyspace Consider renaming the schema values for MATERIALIZEDVIEW_BUILDS and BUILT_MATERIALIZEDVIEWS to be consistent w/other schema entries. Maybe follow COMPACTIONS_IN_PROGRESS w/MATERIALIZED_VIEW_BUILDS_IN_PROGRESS = "materielizedviews_builds_in_progress"? The current values don't immediately differentiate themselves from one another to me. TrncateVerbHandler We should probably move MV truncation into cfs.truncateBlocking as MV are children of the CFS, rather than have it being a part of the VerbHandler. StorageProxy Need clearer differentiation between the 2 wrapBatchResponseHandler methods rather than just a comment. If initiating writes is a side-effect of the method, its name should reflect that. I'm also not sure that comment is accurate as the difference between the 2 methods currently appears to be only around naturalEndpoint determination MaterializedViewLongTest ttlTest looks like it should be a regular unit test rather than a long MaterializedViewTest Comment on why you're swallowing exceptions Comment tests to indicate the use-cases they're testing Do we really need to sleep in 1000ms increments on a CREATE MATERIALIZED VIEW statement? (This is more a question on expected performance w/a base CF w/100 records than anything) testAllTypes is going to atrophy when new types are introduced MaterializedViewUtilsTest Needs comments nits Unused imports in Keyspace.java Double-whitespace in Keyspace.apply Unused import in RowUpdateBuilder.java Double-whitespace in StorageProxy.mutateAtomically Remove left over debug logging in StorageProxy.mutateAtomically MaterializedViewLongTest: double whitespacing, unused import MaterializedViewTest: unused imports, double whitespace MaterializedViewUtilsTest: unused imports Things I'd like to see performance tested Pre-MV vs. Post-MV patch, regular read/write/mixed throughput and latency Pre-MV vs. Post-MV patch, batchlog stress Given the changes to non-MV queries on the critical path, I think it's important we at least confirm we're not introducing any obvious performance regressions to those query types when excluding the MV code-paths from apply calls.
          Hide
          tjake T Jake Luciani added a comment -

          and not all of the rows have that column present (it's null) - what happens to rebuild then?

          Well by definition then the MV row is null because part of the PK is null. But in the existing case this is a 1:1 relationship since we only allow one Nullable column in the PK of the MV (when base is null then view is null). If we applied the same idea to multiple Nullable fields then we would end up with a much more confusing situation.
          Consider building from inconsistent replicas:
          Let's say you put non-PK columns A and B into a Materialized view. Replica 1 has All of Column A, Replica 2 has All of column B. The build would end up with no data in the MV. You would need to subsequentally repair the data to build the MV.

          Also, in general I'm not sure ATM how to support things like multiple conflicting TTLs across non-PK columns.

          Show
          tjake T Jake Luciani added a comment - and not all of the rows have that column present (it's null) - what happens to rebuild then? Well by definition then the MV row is null because part of the PK is null. But in the existing case this is a 1:1 relationship since we only allow one Nullable column in the PK of the MV (when base is null then view is null). If we applied the same idea to multiple Nullable fields then we would end up with a much more confusing situation. Consider building from inconsistent replicas: Let's say you put non-PK columns A and B into a Materialized view. Replica 1 has All of Column A, Replica 2 has All of column B. The build would end up with no data in the MV. You would need to subsequentally repair the data to build the MV. Also, in general I'm not sure ATM how to support things like multiple conflicting TTLs across non-PK columns.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I must be missing something.

          If you have just one base-table regular column in MV's partition key (the current restriction), and issue a CREATE MATERIALIZED VIEW, and not all of the rows have that column present (it's null) - what happens to rebuild then?

          Show
          iamaleksey Aleksey Yeschenko added a comment - I must be missing something. If you have just one base-table regular column in MV's partition key (the current restriction), and issue a CREATE MATERIALIZED VIEW , and not all of the rows have that column present (it's null ) - what happens to rebuild then?
          Hide
          tjake T Jake Luciani added a comment -

          and have the operator correct it.

          How can you correct it? Re-insert all the data? that doesn't sound like a good plan. Esp when considering you'd have a half broken MV sitting there.

          Show
          tjake T Jake Luciani added a comment - and have the operator correct it. How can you correct it? Re-insert all the data? that doesn't sound like a good plan. Esp when considering you'd have a half broken MV sitting there.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          We can disallow such updates (ones that don't have all the columns that are in a PK in one of the views). For data that's already there, we'll error out during the build, and have the operator correct it.

          Show
          iamaleksey Aleksey Yeschenko added a comment - We can disallow such updates (ones that don't have all the columns that are in a PK in one of the views). For data that's already there, we'll error out during the build, and have the operator correct it.
          Hide
          tjake T Jake Luciani added a comment - - edited

          What I'm saying is we can't insert nulls into clustering or partition keys. we don't support it. So we can't put multiple non-pk columns into a materialized view PK we (of course they can be in the non-PK columns of the view)

          cqlsh:test> create table test(foo text, bar text, baz text, PRIMARY KEY(foo,bar));
          cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( 'a', 'b', 'c');
          cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( 'a', null, 'c');
          InvalidRequest: code=2200 [Invalid query] message="Invalid null value for clustering key part bar"
          cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( null, 'b', 'c');
          InvalidRequest: code=2200 [Invalid query] message="Invalid null value for partition key part foo"
          cqlsh:test> 
          
          
          Show
          tjake T Jake Luciani added a comment - - edited What I'm saying is we can't insert nulls into clustering or partition keys. we don't support it. So we can't put multiple non-pk columns into a materialized view PK we (of course they can be in the non-PK columns of the view) cqlsh:test> create table test(foo text, bar text, baz text, PRIMARY KEY(foo,bar)); cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( 'a', 'b', 'c'); cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( 'a', null , 'c'); InvalidRequest: code=2200 [Invalid query] message= "Invalid null value for clustering key part bar" cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( null , 'b', 'c'); InvalidRequest: code=2200 [Invalid query] message= "Invalid null value for partition key part foo" cqlsh:test>
          Hide
          jjordan Jeremiah Jordan added a comment - - edited

          But what about data inserted before the MV existed? Also what about UPDATEs/INSERTs that don't include all columns? Would you have to include it? What if it already has a value on disk? Why can't MV just deal with null as a valid value?

          Show
          jjordan Jeremiah Jordan added a comment - - edited But what about data inserted before the MV existed? Also what about UPDATEs/INSERTs that don't include all columns? Would you have to include it? What if it already has a value on disk? Why can't MV just deal with null as a valid value?
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          But if we know there is an MV with cetrain columns, we can perform the validation even without NOT NULL, just inferring the constraint from presence of the MV?

          Show
          iamaleksey Aleksey Yeschenko added a comment - But if we know there is an MV with cetrain columns, we can perform the validation even without NOT NULL , just inferring the constraint from presence of the MV?
          Hide
          tjake T Jake Luciani added a comment -

          You can include any part of the original primary key in your MV primary key and a single column that was not part of your original primary key

          The reason is we don't have a way to enforce NOT NULL for non-PK keys. If you have two non-PK cols as part of the view PK then how can you update the view if the other is NULL.

          I was thinking if we added a NOT NULL syntax to CQL column defintions like exist in SQL we could probably support this.

          Show
          tjake T Jake Luciani added a comment - You can include any part of the original primary key in your MV primary key and a single column that was not part of your original primary key The reason is we don't have a way to enforce NOT NULL for non-PK keys. If you have two non-PK cols as part of the view PK then how can you update the view if the other is NULL. I was thinking if we added a NOT NULL syntax to CQL column defintions like exist in SQL we could probably support this.
          Hide
          jbellis Jonathan Ellis added a comment -

          A couple points from Christopher Batey's blog post:

          You can include any part of the original primary key in your MV primary key and a single column that was not part of your original primary key

          I think this is the same thing Jack is pointing out above. Is there a compelling technical reason behind this? We lose a lot of flexibility if we can't have multiple non-pk columns in either partition key or clustering.

          Any part of the original primary key you don't use will be added to the end of your clustering columns to keep it a one to one mapping

          IMO we should return an error and let them correct it rather than silently changing the definition from what was given.

          If the part of your primary key is NULL then it won't appear in the materialised view

          This also sounds like something we should fix or it will bite people in unpleasant ways.

          Show
          jbellis Jonathan Ellis added a comment - A couple points from Christopher Batey 's blog post : You can include any part of the original primary key in your MV primary key and a single column that was not part of your original primary key I think this is the same thing Jack is pointing out above. Is there a compelling technical reason behind this? We lose a lot of flexibility if we can't have multiple non-pk columns in either partition key or clustering. Any part of the original primary key you don't use will be added to the end of your clustering columns to keep it a one to one mapping IMO we should return an error and let them correct it rather than silently changing the definition from what was given. If the part of your primary key is NULL then it won't appear in the materialised view This also sounds like something we should fix or it will bite people in unpleasant ways.
          Hide
          jkrupan Jack Krupansky added a comment -

          we allow <= 1 non-pk column in a MV partition key however the error message we log on multiple attempts reads "Cannot include non-primary key column '%s' in materialized view partition key". We should log that <= 1 are allowed instead.

          Wow, is that really true? It sounds like a crippling restriction. Is that simply a short-term expediency for the initial elease or a hard-core long-term restriction?

          Just as an example if I had a table with name, address and id, with id as the primary key, I couldn't have an MV with just name or just address or just name and address as the partition key, right?

          In particular, this restriction seems to preclude pure inverted index MVs - where the non-key content of the row is used to index the key for the row.

          Still waiting to read an updated CQL spec - especially any such limitations.

          Show
          jkrupan Jack Krupansky added a comment - we allow <= 1 non-pk column in a MV partition key however the error message we log on multiple attempts reads "Cannot include non-primary key column '%s' in materialized view partition key". We should log that <= 1 are allowed instead. Wow, is that really true? It sounds like a crippling restriction. Is that simply a short-term expediency for the initial elease or a hard-core long-term restriction? Just as an example if I had a table with name, address and id, with id as the primary key, I couldn't have an MV with just name or just address or just name and address as the partition key, right? In particular, this restriction seems to preclude pure inverted index MVs - where the non-key content of the row is used to index the key for the row. Still waiting to read an updated CQL spec - especially any such limitations.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          Thus far, the implementation looks pretty solid. Fairly self-documenting (pending some of the naming issues we discussed above). I have more code to go through but have some more feedback.

          Questions:

          1. Currently we're submitting the MaterializedViewBuilder to the standard executor in CompactionManager. Is it logical to have these 2 operations share a thread pool and resources?
          2. Why the removal of updatePKIndexes from Cells.java.reconcile?

          Feedback:

          General
          • Why are concurrent_batchlog_writes and concurrent_materialized_view_writes both hard-coded to 32 and not-in the yaml?
          AlterTableStatement
          • mv.included.isEmpty() as a check to see if all columns are included is counter-intuitive. Add a helper function mv.included.containsAll() (mentioned prior, may be addressed in subsequent commits)
          • announceMigration: materializedViewDrops is never initialized or used
          CreateMaterializedViewStatement
          • in CreateMaterializedViewStatement.announceMigration, while turning ColumnIdentifier.Raw into ColumnIdentifier, we allow <= 1 non-pk column in a MV partition key however the error message we log on multiple attempts reads "Cannot include non-primary key column '%s' in materialized view partition key". We should log that <= 1 are allowed instead. We should also document why this restriction is in-place in the code.
          • refactor out duplication w/building targetPartitionKeys and targetPartitionColumns w/nonPKTarget
          DropMaterializedViewStatement
          • In findMaterializedView, you don't need to iterate across all the members of cfm.getMaterializedViews().values() as it's a Map, you can just check for whether or not it contains a member at columnFamily() index.
          SSTableIterator
          • l252. Was this addressing a bug you uncovered during development or an accidental change? Your update is changing what we're comparing to indexes.size() by pre-incrementing currentIndexIdx before said comparison.
          MaterializedViewBuilder
          • MaterializedViewBuilder.getCompactionInfo isn't giving us particularly good information about the # Tokens built vs. total. We discussed this offline - needs some comments in the code as to why it's currently limited in this fashion.
          MaterializedViewUtils
          • Should probably add a comment explaining why baseNaturalEndpoints and viewNaturalEndpoints always have the same # of entries, so .get on baseIdx is safe. Otherwise it takes a lot of knowledge about the MV RF implementation to understand it (thinking about future developers here)
          AbstractReadCommandBuilder
          • In .makeColumnFilter() - Why the change to the temporary stack ptr for CFMetaData?
          • It's not immediately clear to me why you changed from the allColumnsBuilder to the selectionBuilder - could use some clarification on that (for me here, not necessarily comments on code)
          ColumnFamilyStore
          • Pull contents of initRowCache into init() -> rather than breaking into 2 separate methods, just have the 1 renamed w/MVM init in it
          Nits:
          • SingleColumnRestriction: unnecessary whitespace changes
          • AlterTableStatement:
            • extraneous whitespace in announceMigration
            • tab in announceMigration modified to break 120 char
          • CreateMaterializedViewStatement:
            • unused imports
            • Double whitespace between methods
          • DropMaterializedViewStatement: "Cannot drop non existing" should be "Cannot drop non-existent"
          • CompactionManager: Need space after submitMaterializedViewBuilder method close
          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Thus far, the implementation looks pretty solid. Fairly self-documenting (pending some of the naming issues we discussed above). I have more code to go through but have some more feedback. Questions: Currently we're submitting the MaterializedViewBuilder to the standard executor in CompactionManager. Is it logical to have these 2 operations share a thread pool and resources? Why the removal of updatePKIndexes from Cells.java.reconcile? Feedback: General Why are concurrent_batchlog_writes and concurrent_materialized_view_writes both hard-coded to 32 and not-in the yaml? AlterTableStatement mv.included.isEmpty() as a check to see if all columns are included is counter-intuitive. Add a helper function mv.included.containsAll() (mentioned prior, may be addressed in subsequent commits) announceMigration: materializedViewDrops is never initialized or used CreateMaterializedViewStatement in CreateMaterializedViewStatement.announceMigration, while turning ColumnIdentifier.Raw into ColumnIdentifier, we allow <= 1 non-pk column in a MV partition key however the error message we log on multiple attempts reads "Cannot include non-primary key column '%s' in materialized view partition key". We should log that <= 1 are allowed instead. We should also document why this restriction is in-place in the code. refactor out duplication w/building targetPartitionKeys and targetPartitionColumns w/nonPKTarget DropMaterializedViewStatement In findMaterializedView, you don't need to iterate across all the members of cfm.getMaterializedViews().values() as it's a Map, you can just check for whether or not it contains a member at columnFamily() index. SSTableIterator l252. Was this addressing a bug you uncovered during development or an accidental change? Your update is changing what we're comparing to indexes.size() by pre-incrementing currentIndexIdx before said comparison. MaterializedViewBuilder MaterializedViewBuilder.getCompactionInfo isn't giving us particularly good information about the # Tokens built vs. total. We discussed this offline - needs some comments in the code as to why it's currently limited in this fashion. MaterializedViewUtils Should probably add a comment explaining why baseNaturalEndpoints and viewNaturalEndpoints always have the same # of entries, so .get on baseIdx is safe. Otherwise it takes a lot of knowledge about the MV RF implementation to understand it (thinking about future developers here) AbstractReadCommandBuilder In .makeColumnFilter() - Why the change to the temporary stack ptr for CFMetaData? It's not immediately clear to me why you changed from the allColumnsBuilder to the selectionBuilder - could use some clarification on that (for me here, not necessarily comments on code) ColumnFamilyStore Pull contents of initRowCache into init() -> rather than breaking into 2 separate methods, just have the 1 renamed w/MVM init in it Nits: SingleColumnRestriction: unnecessary whitespace changes AlterTableStatement: extraneous whitespace in announceMigration tab in announceMigration modified to break 120 char CreateMaterializedViewStatement: unused imports Double whitespace between methods DropMaterializedViewStatement: "Cannot drop non existing" should be "Cannot drop non-existent" CompactionManager: Need space after submitMaterializedViewBuilder method close
          Hide
          aboudreault Alan Boudreault added a comment -

          Using Carl's branch 6477-rebase

          Show
          aboudreault Alan Boudreault added a comment - Using Carl's branch 6477-rebase
          Hide
          aboudreault Alan Boudreault added a comment -

          Okay, I've been working on these comparisons but haven't been able to provide useful results due to an issue I hit. I am doing my benchmarks on ec2 with a cluster of 3 nodes. Basically, I can get realistic and useful results with C* stock (no MV) and C* with a Secondary Index ( between 70000 and 85000 op/s). When it comes to testing C* with 1 MV, I got many many WriteTimeoutExceptions which results in a performance of 100 operations per second. I have been able to reproduce that 100 op/s locally using a 3 nodes cluster. The issue doesn't seem to be present when using a single node cluster.

          I've profiled one of the node and it looks like most of the time is spend in io.netty.channel.epoll.EpollEventLoop.epollWait() (like 75% of the time).

          Here's a yourkit snapshot of the first node of the cluster.
          http://dl.alanb.ca/CassandraDaemon-cluster-3-nodes-2015-07-10.snapshot.zip

          I've attached my users.yaml profile that I am using for testing: users.yaml

          Here's the materialized view creation statement:

          CREATE MATERIALIZED VIEW perftesting.users_by_first_name AS SELECT * FROM perftesting.users PRIMARY KEY (first_name);
          

          Here's the stress command I've been using:

          cassandra-stress user profile=/path/to/users.yaml ops\(insert=1\) n=5000000 no-warmup -pop seq=1..200M  no-wrap -rate threads=200 -node 127.0.0.1,127.0.0.2,127.0.0.3
          

          Let me know if I am doing anything wrong or if I can provide anything else to help. I'll provide the benchmarks as soon as I have a workaround for this issue.

          Show
          aboudreault Alan Boudreault added a comment - Okay, I've been working on these comparisons but haven't been able to provide useful results due to an issue I hit. I am doing my benchmarks on ec2 with a cluster of 3 nodes. Basically, I can get realistic and useful results with C* stock (no MV) and C* with a Secondary Index ( between 70000 and 85000 op/s). When it comes to testing C* with 1 MV, I got many many WriteTimeoutExceptions which results in a performance of 100 operations per second. I have been able to reproduce that 100 op/s locally using a 3 nodes cluster. The issue doesn't seem to be present when using a single node cluster. I've profiled one of the node and it looks like most of the time is spend in io.netty.channel.epoll.EpollEventLoop.epollWait() (like 75% of the time). Here's a yourkit snapshot of the first node of the cluster. http://dl.alanb.ca/CassandraDaemon-cluster-3-nodes-2015-07-10.snapshot.zip I've attached my users.yaml profile that I am using for testing: users.yaml Here's the materialized view creation statement: CREATE MATERIALIZED VIEW perftesting.users_by_first_name AS SELECT * FROM perftesting.users PRIMARY KEY (first_name); Here's the stress command I've been using: cassandra-stress user profile=/path/to/users.yaml ops\(insert=1\) n=5000000 no-warmup -pop seq=1..200M no-wrap -rate threads=200 -node 127.0.0.1,127.0.0.2,127.0.0.3 Let me know if I am doing anything wrong or if I can provide anything else to help. I'll provide the benchmarks as soon as I have a workaround for this issue.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          And, if possible, a comparison with manual denormalisation (in case we don't need read before write and have all the columns known to perform both writes).

          Show
          iamaleksey Aleksey Yeschenko added a comment - And, if possible, a comparison with manual denormalisation (in case we don't need read before write and have all the columns known to perform both writes).
          Hide
          aboudreault Alan Boudreault added a comment -

          Sure, will do today.

          Show
          aboudreault Alan Boudreault added a comment - Sure, will do today.
          Hide
          tjake T Jake Luciani added a comment -

          Alan Boudreault can you compare with counters and secondary indexes for some relative performance numbers?

          Show
          tjake T Jake Luciani added a comment - Alan Boudreault can you compare with counters and secondary indexes for some relative performance numbers?
          Hide
          aboudreault Alan Boudreault added a comment -

          This morning, I did some initial performance testing to get a first overview. I tested the following scenarios:

          • Writing without MV
          • Writing with 1 MV
          • Writing with 3 MV
          • Writing with 5 MV

          Results: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=cassandra-mv-benchmark-2015-07-08.log&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=341.77&ymin=0&ymax=89079.1

          Briefly, with 1 MV created, which produces more mutations internally.. we got 33% of the initial performance. Surprisingly, I don't see any performance decrease with 3 and 5 MVs as I would have expected. This leads us to think that it might be caused by bottleneck (CPU? IO? some internal locks?)

          My next step is to profile what's going on to understand more these results.

          Show
          aboudreault Alan Boudreault added a comment - This morning, I did some initial performance testing to get a first overview. I tested the following scenarios: Writing without MV Writing with 1 MV Writing with 3 MV Writing with 5 MV Results: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=cassandra-mv-benchmark-2015-07-08.log&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=341.77&ymin=0&ymax=89079.1 Briefly, with 1 MV created, which produces more mutations internally.. we got 33% of the initial performance. Surprisingly, I don't see any performance decrease with 3 and 5 MVs as I would have expected. This leads us to think that it might be caused by bottleneck (CPU? IO? some internal locks?) My next step is to profile what's going on to understand more these results.
          Hide
          jkrupan Jack Krupansky added a comment - - edited

          I don't see updated CQL.textile for CREATE MV on the branch. Coming soon?

          Also, the comment for CREATE MV in CQL.g does not quite match the actual syntax:

          1. Missing the IF NOT EXISTS clause.
          2. Has parentheses around the <columns> list, but SELECT does not have that.
          3. Unclear whether AS or functions are supported in the column name list, but selectStatement would certainly allow that.
          4. Has FROM (<columnName>), which should be FROM <CF>, I think.

          Show
          jkrupan Jack Krupansky added a comment - - edited I don't see updated CQL.textile for CREATE MV on the branch. Coming soon? Also, the comment for CREATE MV in CQL.g does not quite match the actual syntax: 1. Missing the IF NOT EXISTS clause. 2. Has parentheses around the <columns> list, but SELECT does not have that. 3. Unclear whether AS or functions are supported in the column name list, but selectStatement would certainly allow that. 4. Has FROM (<columnName>), which should be FROM <CF>, I think.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          it's also nice to keep the discussion in Jira so that it's easier to understand past design decisions when you look at tickets

          I was thinking about that. We absolutely don't want to lose discussion history; I was thinking naming these types of branches <ticketnum>_review and having a policy of never deleting _review branches so they could be referenced for historical purposes. Not a particularly official / formal approach however with no safeguards other than behavior, and also dependent on github not archiving/removing old branches, space issues on there, etc.

          The difficulty of transposing review information into a jira comment and convenience is secondary to the fact that this approach takes my comments out of the immediate context of what I'm thinking, requiring a translation from me to here and from here back to whomever; my worry is that there's more chance of something getting lost in translation there.

          But yeah - dev ML seems the way to go for this topic.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - it's also nice to keep the discussion in Jira so that it's easier to understand past design decisions when you look at tickets I was thinking about that. We absolutely don't want to lose discussion history; I was thinking naming these types of branches <ticketnum>_review and having a policy of never deleting _review branches so they could be referenced for historical purposes. Not a particularly official / formal approach however with no safeguards other than behavior, and also dependent on github not archiving/removing old branches, space issues on there, etc. The difficulty of transposing review information into a jira comment and convenience is secondary to the fact that this approach takes my comments out of the immediate context of what I'm thinking, requiring a translation from me to here and from here back to whomever; my worry is that there's more chance of something getting lost in translation there. But yeah - dev ML seems the way to go for this topic.
          Hide
          thobbs Tyler Hobbs added a comment -

          I'm considering doing subsequent reviews as a branch I push to github w/comments inline (similar to what Benedict's been doing lately) with feedback as putting this quantity of feedback in Jira comments is a burden for both you and I - thoughts?

          We should probably discuss this on the dev ML, but here are my thoughts. It's way more convenient to comment on Github, but, it's also nice to keep the discussion in Jira so that it's easier to understand past design decisions when you look at tickets. Maybe there's a nice middle-ground?

          Show
          thobbs Tyler Hobbs added a comment - I'm considering doing subsequent reviews as a branch I push to github w/comments inline (similar to what Benedict's been doing lately) with feedback as putting this quantity of feedback in Jira comments is a burden for both you and I - thoughts? We should probably discuss this on the dev ML, but here are my thoughts. It's way more convenient to comment on Github, but , it's also nice to keep the discussion in Jira so that it's easier to understand past design decisions when you look at tickets. Maybe there's a nice middle-ground?
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -
          Overall:

          Needs comments. Class level, javadoc where appropriate, etc.

          CFMetaData
          • duplicate entry for comparison if triggers added in CFMetadata.triggers
          • nit: space after CFMetaData.getMaterializedViews function end
          MaterializedViewManager
          • buildIfRequired: scope of whether it's required or skipped isn't in this method or within view.build(), so it looks like it's just building (or at least submitting the builder to the CompactionManager)
          • You're using SystemKeyspace.setIndexRemoved to mark the MV removed in removeMaterializedView but not the parallel setIndexBuilt to set them as built.
          • allViews has unnecessary genericity
          • nits:
            • Some unused imports
            • Consider renaming reload to indicate what it's reloading (e.g. reloadViews, reloadChickens, reloadDolphins)
          MaterializedView
          • createForDeletionInfo:
            • Consider caching/precomputing the results of CFMetaData.hasComplexColumns rather than iterating through them twice on each call (once on CFMetaData.hascomplexColumns, again in for loop in createForDeletionInfo)
              • Perhaps hooking into addOrReplaceColumnDefinition, removeColumnDefinition and updating MV's with the data, keep a set inside MaterializedView and reference that
          • Don't need MaterializedView.reload() as it's just a passthrough to build and used in 1 place
          • Tighten up scoping on member variables - some package private that can be explicitly private
          • ctor: Duplication in the ctor in treatment of MVDefinition partition and clustering columns - could use a slight refactor to a function.
          • createTombstone / createComplexTombstone: refactor out building viewClusteringValues into method to remove duplication
          • targetPartitionKey: collapse spacing on:
            return viewCfs
               .partitioner
               .decorateKey(CFMetaData
                              .serializePartitionKey(viewCfs
                                                   .metadata
                                                   .getKeyValidatorAsClusteringComparator()
                                                   .make(partitionKey)));
            

            to something like:

            return viewCfs.partitioner.decorateKey(CFMetaData.serializePartitionKey(viewCfs.metadata
                                                                                    .getKeyValidatorAsClusteringComparator()
                                                                                    .make(partitionKey)));
            
          • createPartitionTombstonesForUpdates: Can simply return mutation @ end of function
          • createForDeletionInfo: unused parameter consistency
          • Inconsistent naming - using mutationUnits scattered throughout vs. LiveRowState
          • Rename query to queryMaterializedViewData or something similar that denotes what you're querying. Perhaps buildMVData
          • nits:
            • Some unused imports
            • ctor: Change nonPrimaryKeyCol to allPrimaryKeyCol as true, flip to false when non found so we're not assigning a double-negative to targetHasAllPrimaryKeyColumns
            • extra space in MaterializedView.createForDeletionInfo
            • ctor declaration fits on 1 line < 120 char, same for some other lines that have been wrapped
            • hte in javadoc for targetPartitionKey
          MaterializedViewDefinition
          • Consider caching sets of columns in the MV rather than pulling from CFMetaData and iterating. This would allow for faster lookup and simpler code than having to iterate across all members (MaterializedViewDefinition.selects for instance).
          • Consider using Sets of columns rather than Lists internally, would remove a lot of O(N) lookups and duplicate code logic scattered throughout the class (selects, renameColumn, etc)
          • selects(): if included.isEmpty() is apparently an indicator that it includes all. We should 1) comment that and 2) wrap a method around that behavior, e.g. 'boolean isAllInclusive()' that documents that behavior in the code.
          • nits:
            • I prefer copy constructors to a .copy() method and I think I've seen us err on the side of the copy constructor. I could be wrong here.
            • assert included != null and !isEmpty on ctor
            • extra whitespace in renameColumn @ top of method, @ end of method
          LiveRowState
          • The name conflates with LivenessInfo (at least for me) which we discussed on IRC. My biggest hurdle to grokking this class was un-plugging that and re-plugging the idea that it's really about materializing the data for the MaterializedView. Perhaps rename to something like TemporalRow, with LRSCell becoming TemporalCell? While LRSCell doesn't really have any temporal data above and beyond a general Cell, it does contain its own reconcile that takes temporality into account which makes sense.
          • Instead of addUnit / getExistingUnit in LiveRowState.Set, if going with the above these could be named addRow, getExistingRow
          • Think we can get rid of getInternedUnit entirely and just have a section in addUnit to initialize empty entries
          • Consider having LRSCell implement db.rows.Cell and adding a Conflicts.resolveCells(Cell left, Cell right, int nowInSec) method that passes through to Conficts.resolveRegular. Simplifies up a few different users of the method throughout the code-base.
          • Can do away w/PrivateResolver vs. Resolver. LRSCell being private should prevent non-class anonymous instantiations of the interface.
          • Various unused methods and constructors
          • Comment on whether or not retention of ordering is important in baseSlice as you could do away w/interim ByteBuffer array if not. I believe it is, but worth clarifying.
          • nit: whitespace inconsistencies in values(...)

          I definitely have more to review but figured I'd get the feedback I have thus far into here. I'm considering doing subsequent reviews as a branch I push to github w/comments inline (similar to what Benedict's been doing lately) with feedback as putting this quantity of feedback in Jira comments is a burden for both you and I - thoughts?

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Overall: Needs comments. Class level, javadoc where appropriate, etc. CFMetaData duplicate entry for comparison if triggers added in CFMetadata.triggers nit: space after CFMetaData.getMaterializedViews function end MaterializedViewManager buildIfRequired: scope of whether it's required or skipped isn't in this method or within view.build(), so it looks like it's just building (or at least submitting the builder to the CompactionManager) You're using SystemKeyspace.setIndexRemoved to mark the MV removed in removeMaterializedView but not the parallel setIndexBuilt to set them as built. allViews has unnecessary genericity nits: Some unused imports Consider renaming reload to indicate what it's reloading (e.g. reloadViews, reloadChickens, reloadDolphins) MaterializedView createForDeletionInfo: Consider caching/precomputing the results of CFMetaData.hasComplexColumns rather than iterating through them twice on each call (once on CFMetaData.hascomplexColumns, again in for loop in createForDeletionInfo) Perhaps hooking into addOrReplaceColumnDefinition, removeColumnDefinition and updating MV's with the data, keep a set inside MaterializedView and reference that Don't need MaterializedView.reload() as it's just a passthrough to build and used in 1 place Tighten up scoping on member variables - some package private that can be explicitly private ctor: Duplication in the ctor in treatment of MVDefinition partition and clustering columns - could use a slight refactor to a function. createTombstone / createComplexTombstone: refactor out building viewClusteringValues into method to remove duplication targetPartitionKey: collapse spacing on: return viewCfs .partitioner .decorateKey(CFMetaData .serializePartitionKey(viewCfs .metadata .getKeyValidatorAsClusteringComparator() .make(partitionKey))); to something like: return viewCfs.partitioner.decorateKey(CFMetaData.serializePartitionKey(viewCfs.metadata .getKeyValidatorAsClusteringComparator() .make(partitionKey))); createPartitionTombstonesForUpdates: Can simply return mutation @ end of function createForDeletionInfo: unused parameter consistency Inconsistent naming - using mutationUnits scattered throughout vs. LiveRowState Rename query to queryMaterializedViewData or something similar that denotes what you're querying. Perhaps buildMVData nits: Some unused imports ctor: Change nonPrimaryKeyCol to allPrimaryKeyCol as true, flip to false when non found so we're not assigning a double-negative to targetHasAllPrimaryKeyColumns extra space in MaterializedView.createForDeletionInfo ctor declaration fits on 1 line < 120 char, same for some other lines that have been wrapped hte in javadoc for targetPartitionKey MaterializedViewDefinition Consider caching sets of columns in the MV rather than pulling from CFMetaData and iterating. This would allow for faster lookup and simpler code than having to iterate across all members (MaterializedViewDefinition.selects for instance). Consider using Sets of columns rather than Lists internally, would remove a lot of O(N) lookups and duplicate code logic scattered throughout the class (selects, renameColumn, etc) selects(): if included.isEmpty() is apparently an indicator that it includes all. We should 1) comment that and 2) wrap a method around that behavior, e.g. 'boolean isAllInclusive()' that documents that behavior in the code. nits: I prefer copy constructors to a .copy() method and I think I've seen us err on the side of the copy constructor. I could be wrong here. assert included != null and !isEmpty on ctor extra whitespace in renameColumn @ top of method, @ end of method LiveRowState The name conflates with LivenessInfo (at least for me) which we discussed on IRC. My biggest hurdle to grokking this class was un-plugging that and re-plugging the idea that it's really about materializing the data for the MaterializedView. Perhaps rename to something like TemporalRow , with LRSCell becoming TemporalCell ? While LRSCell doesn't really have any temporal data above and beyond a general Cell, it does contain its own reconcile that takes temporality into account which makes sense. Instead of addUnit / getExistingUnit in LiveRowState.Set, if going with the above these could be named addRow , getExistingRow Think we can get rid of getInternedUnit entirely and just have a section in addUnit to initialize empty entries Consider having LRSCell implement db.rows.Cell and adding a Conflicts.resolveCells(Cell left, Cell right, int nowInSec) method that passes through to Conficts.resolveRegular . Simplifies up a few different users of the method throughout the code-base. Can do away w/PrivateResolver vs. Resolver. LRSCell being private should prevent non-class anonymous instantiations of the interface. Various unused methods and constructors Comment on whether or not retention of ordering is important in baseSlice as you could do away w/interim ByteBuffer array if not. I believe it is, but worth clarifying. nit: whitespace inconsistencies in values(...) I definitely have more to review but figured I'd get the feedback I have thus far into here. I'm considering doing subsequent reviews as a branch I push to github w/comments inline (similar to what Benedict's been doing lately) with feedback as putting this quantity of feedback in Jira comments is a burden for both you and I - thoughts?
          Hide
          sebastian.estevez@datastax.com Sebastian Estevez added a comment - - edited

          2i's are a painpoint during bootstraps and repairs because they are rebuilt after streaming. Will this be the case for MV's as well?
          Ideally, they could be streamed as well. Much less intensive operation both CPU and IO wise.

          Show
          sebastian.estevez@datastax.com Sebastian Estevez added a comment - - edited 2i's are a painpoint during bootstraps and repairs because they are rebuilt after streaming. Will this be the case for MV's as well? Ideally, they could be streamed as well. Much less intensive operation both CPU and IO wise.
          Hide
          carlyeks Carl Yeksigian added a comment -

          That just ensures that the values which are currently in the base table will also be in the MV, but not that extraneous values present in the MV are removed. We can also repair the MV using the current repair process, which should fix the inconsistencies which streaming the updates to the MV would fix.

          Show
          carlyeks Carl Yeksigian added a comment - That just ensures that the values which are currently in the base table will also be in the MV, but not that extraneous values present in the MV are removed. We can also repair the MV using the current repair process, which should fix the inconsistencies which streaming the updates to the MV would fix.
          Hide
          jbellis Jonathan Ellis added a comment -

          Can't we wire up streamed updates to MV replication to make that unnecessary, the way we do with 2i?

          Show
          jbellis Jonathan Ellis added a comment - Can't we wire up streamed updates to MV replication to make that unnecessary, the way we do with 2i?
          Hide
          carlyeks Carl Yeksigian added a comment -

          It should at the very least be a follow-up ticket; it deserves a discussion, including what the overhead will be, and whether it is worth it.

          Show
          carlyeks Carl Yeksigian added a comment - It should at the very least be a follow-up ticket; it deserves a discussion, including what the overhead will be, and whether it is worth it.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          In the patch, there is no way to do a repair between the MV and the base table other than dropping and recreating the view.

          Would this be worth a follow-up ticket, or would the repair overhead / time frame be long enough that it makes sense just to recommend people drop/recreate MV after repair? If so, would it make sense to bolt that functionality as an option onto the repair process?

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - In the patch, there is no way to do a repair between the MV and the base table other than dropping and recreating the view. Would this be worth a follow-up ticket, or would the repair overhead / time frame be long enough that it makes sense just to recommend people drop/recreate MV after repair? If so, would it make sense to bolt that functionality as an option onto the repair process?
          Hide
          carlyeks Carl Yeksigian added a comment -
          1. MV use the batchlog in order to provide eventual consistency
          2. MV updates follow the same CL as the base update has
          3. No, repairs to the base table only assure that the base table is consistent; same with a repair on a MV

          In the patch, there is no way to do a repair between the MV and the base table other than dropping and recreating the view.

          Show
          carlyeks Carl Yeksigian added a comment - MV use the batchlog in order to provide eventual consistency MV updates follow the same CL as the base update has No, repairs to the base table only assure that the base table is consistent; same with a repair on a MV In the patch, there is no way to do a repair between the MV and the base table other than dropping and recreating the view.
          Hide
          jkrupan Jack Krupansky added a comment -

          1. Are MV updates still eventually consistent (not guaranteed)?

          2. Is there any way for the app to assure that the MV update have been completed to some desired CL?

          3. Will a repair to the base table assure that all MV are consistent?

          4. Can a single MV be repaired to assure that it is consistent? (Especially since the data for a MV on a node will be derived from data on other nodes due to differences in the partition keys.)

          Great to see such an exciting new feature take shape!

          Show
          jkrupan Jack Krupansky added a comment - 1. Are MV updates still eventually consistent (not guaranteed)? 2. Is there any way for the app to assure that the MV update have been completed to some desired CL? 3. Will a repair to the base table assure that all MV are consistent? 4. Can a single MV be repaired to assure that it is consistent? (Especially since the data for a MV on a node will be derived from data on other nodes due to differences in the partition keys.) Great to see such an exciting new feature take shape!
          Hide
          carlyeks Carl Yeksigian added a comment -

          Alan Boudreault: This was an unrelated issue - it was related to keeping track whether the MV was built or not; I'm still looking at what is causing the inconsistency when a new node has been added. I have only seen it occur once in my testing, but it would be good to catch.

          Aleksey Yeschenko: I haven't done any cstar testing yet; have only done local stress. Would be good to run that, though I don't think we can run it through the web interface as we'll need to create the MV. /cc Ryan McGuire

          Show
          carlyeks Carl Yeksigian added a comment - Alan Boudreault : This was an unrelated issue - it was related to keeping track whether the MV was built or not; I'm still looking at what is causing the inconsistency when a new node has been added. I have only seen it occur once in my testing, but it would be good to catch. Aleksey Yeschenko : I haven't done any cstar testing yet; have only done local stress. Would be good to run that, though I don't think we can run it through the web interface as we'll need to create the MV. /cc Ryan McGuire
          Hide
          aboudreault Alan Boudreault added a comment -

          I haven't done any performance tests at the moment. Carl Yeksigian, have you done some? If not, I guess I can do that this week.

          Show
          aboudreault Alan Boudreault added a comment - I haven't done any performance tests at the moment. Carl Yeksigian , have you done some? If not, I guess I can do that this week.
          Hide
          aboudreault Alan Boudreault added a comment -

          Is it related to the issue I found when adding a node?

          Show
          aboudreault Alan Boudreault added a comment - Is it related to the issue I found when adding a node?
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Do we have any performance (cstar) numbers yet?

          Show
          iamaleksey Aleksey Yeschenko added a comment - Do we have any performance (cstar) numbers yet?
          Hide
          carlyeks Carl Yeksigian added a comment -

          Christopher Batey Thanks for trying out the materialized views branch! I've pushed up a rebased branch against trunk (after 8099 was integrated) and have fixed the issue you found.

          Show
          carlyeks Carl Yeksigian added a comment - Christopher Batey Thanks for trying out the materialized views branch! I've pushed up a rebased branch against trunk (after 8099 was integrated) and have fixed the issue you found.
          Hide
          chbatey Christopher Batey added a comment - - edited

          I got a lot of these in my log when playing with the branch, however the views I created worked fine.

          WARN  [CompactionExecutor:111] 2015-07-07 10:57:09,023 MaterializedViewBuilder.java:189 - Materialized View failed to complete, sleeping 5 minutes before restarting
          org.apache.cassandra.exceptions.InvalidRequestException: Missing mandatory PRIMARY KEY part host_id
                  at org.apache.cassandra.cql3.statements.RequestValidations.invalidRequest(RequestValidations.java:199) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.RequestValidations.checkTrue(RequestValidations.java:63) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.RequestValidations.checkNotNull(RequestValidations.java:140) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.ModificationStatement.buildPartitionKeyNames(ModificationStatement.java:395) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.ModificationStatement.getMutations(ModificationStatement.java:752) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:727) ~[main/:na]
                  at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:293) ~[main/:na]
                  at org.apache.cassandra.db.SystemKeyspace.setMaterializedViewBuilt(SystemKeyspace.java:453) ~[main/:na]
                  at org.apache.cassandra.db.SystemKeyspace.finishMaterializedViewBuildStatus(SystemKeyspace.java:472) ~[main/:na]
                  at org.apache.cassandra.db.view.MaterializedViewBuilder.run(MaterializedViewBuilder.java:174) ~[main/:na]
                  at org.apache.cassandra.db.compaction.CompactionManager$13.run(CompactionManager.java:1364) [main/:na]
          
          Show
          chbatey Christopher Batey added a comment - - edited I got a lot of these in my log when playing with the branch, however the views I created worked fine. WARN [CompactionExecutor:111] 2015-07-07 10:57:09,023 MaterializedViewBuilder.java:189 - Materialized View failed to complete, sleeping 5 minutes before restarting org.apache.cassandra.exceptions.InvalidRequestException: Missing mandatory PRIMARY KEY part host_id at org.apache.cassandra.cql3.statements.RequestValidations.invalidRequest(RequestValidations.java:199) ~[main/:na] at org.apache.cassandra.cql3.statements.RequestValidations.checkTrue(RequestValidations.java:63) ~[main/:na] at org.apache.cassandra.cql3.statements.RequestValidations.checkNotNull(RequestValidations.java:140) ~[main/:na] at org.apache.cassandra.cql3.statements.ModificationStatement.buildPartitionKeyNames(ModificationStatement.java:395) ~[main/:na] at org.apache.cassandra.cql3.statements.ModificationStatement.getMutations(ModificationStatement.java:752) ~[main/:na] at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:727) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:293) ~[main/:na] at org.apache.cassandra.db.SystemKeyspace.setMaterializedViewBuilt(SystemKeyspace.java:453) ~[main/:na] at org.apache.cassandra.db.SystemKeyspace.finishMaterializedViewBuildStatus(SystemKeyspace.java:472) ~[main/:na] at org.apache.cassandra.db.view.MaterializedViewBuilder.run(MaterializedViewBuilder.java:174) ~[main/:na] at org.apache.cassandra.db.compaction.CompactionManager$13.run(CompactionManager.java:1364) [main/:na]
          Hide
          carlyeks Carl Yeksigian added a comment -

          There is no ALTER MV in the patch, but I've created an issue for it (CASSANDRA-9736). I'll make the changes so that you can't drop a column or a table depended on by a MV.

          Show
          carlyeks Carl Yeksigian added a comment - There is no ALTER MV in the patch, but I've created an issue for it ( CASSANDRA-9736 ). I'll make the changes so that you can't drop a column or a table depended on by a MV.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Do we have ALTER MV in this patch? I guess if we cannot (yet) remove columns from an MV, then the restriction on the base table makes sense, temporarily.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Do we have ALTER MV in this patch? I guess if we cannot (yet) remove columns from an MV, then the restriction on the base table makes sense, temporarily.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          I think the problem comes in when you have a complex data model with transparent views of a root data source; I could easily see someone dropping a column without realizing there was a specific MV w/that data in it that really needed it.

          A decent alternative may be introducing the CASCADES keyword to indicate that you're ok with it taking out the MV columns, similar to postgres and others.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - I think the problem comes in when you have a complex data model with transparent views of a root data source; I could easily see someone dropping a column without realizing there was a specific MV w/that data in it that really needed it. A decent alternative may be introducing the CASCADES keyword to indicate that you're ok with it taking out the MV columns, similar to postgres and others.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          That seems a bit too strict. I don't see why we wouldn't allow dropping a column that's not in a primary key in any of the MVs.

          Show
          iamaleksey Aleksey Yeschenko added a comment - That seems a bit too strict. I don't see why we wouldn't allow dropping a column that's not in a primary key in any of the MVs.
          Hide
          aboudreault Alan Boudreault added a comment -

          +1 to the drop table too. make sense.

          Show
          aboudreault Alan Boudreault added a comment - +1 to the drop table too. make sense.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          +1 on the "don't allow dropping columns that are part of a MV". While we're not an RDBMS, the parallel to treatment of FK's on a relational schema and dropping tables makes sense and should be the least surprising thing for users.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - +1 on the "don't allow dropping columns that are part of a MV". While we're not an RDBMS, the parallel to treatment of FK's on a relational schema and dropping tables makes sense and should be the least surprising thing for users.
          Hide
          carlyeks Carl Yeksigian added a comment -

          This is mainly historic from when it was going to be a global index, since 2i doesn't require dropping the indexes first. I'd be in favor of changed it back to not allowing dropping columns which are part of the materialized view. Same would apply to dropping a table which has materialized views on them.

          Looking at the new node issue now.

          Show
          carlyeks Carl Yeksigian added a comment - This is mainly historic from when it was going to be a global index, since 2i doesn't require dropping the indexes first. I'd be in favor of changed it back to not allowing dropping columns which are part of the materialized view. Same would apply to dropping a table which has materialized views on them. Looking at the new node issue now.
          Hide
          aboudreault Alan Boudreault added a comment -

          While testing, I noticed that if we drop a column that is used by a materialized view (PK), the view is dropped silently. It looks like it's the desired behavior since a log entry is written saying: " MigrationManager.java:381 - Drop table 'ks/users_by_state'".

          I just wanted to raise a suggestion: would it better (and possible) to force the user to delete its materialize view explicitly before doing this operation, rather than dropping the MV silently? Not a strong opinion here, I'm just thinking that this could be annoying for users to notice some MV disappeared.

          Show
          aboudreault Alan Boudreault added a comment - While testing, I noticed that if we drop a column that is used by a materialized view (PK), the view is dropped silently. It looks like it's the desired behavior since a log entry is written saying: " MigrationManager.java:381 - Drop table 'ks/users_by_state'". I just wanted to raise a suggestion: would it better (and possible) to force the user to delete its materialize view explicitly before doing this operation, rather than dropping the MV silently? Not a strong opinion here, I'm just thinking that this could be annoying for users to notice some MV disappeared.
          Hide
          aboudreault Alan Boudreault added a comment - - edited

          Carl Yeksigian Here is a small script to reproduce the issue discussed on hipchat: test-view-data.sh . Give a try to that script and let me know if you can also reproduce the issue. It is possible that you have to run the script 2-3 times before. I noticed that sometime, the count is OK.

          For the record, the issue is that my materialized view is not synchronized properly with the data in my main table. There is no hints or batchlog recorded.

          Output of the issue:

          MAIN TABLE COUNT
          Consistency level set to ALL.
          
           count
          -------
               8
          
          (1 rows)
          
          MV TABLE COUNT - node1
          Consistency level set to ALL.
          
           count
          -------
               7
          
          (1 rows)
          \MAIN TABLE DATA
          Consistency level set to ALL.
          
           id | v
          ----+---
            5 | 5
            1 | 1
            8 | 8
            2 | 2
            4 | 4
            7 | 7
            6 | 6
            3 | 3
          
          (8 rows)
          
          MV TABLE DATA
          Consistency level set to ALL.
          
           v | id
          ---+----
           1 |  1
           8 |  8
           2 |  2
           4 |  4
           7 |  7
           6 |  6
           3 |  3
          
          (7 rows)
          
          
          Show
          aboudreault Alan Boudreault added a comment - - edited Carl Yeksigian Here is a small script to reproduce the issue discussed on hipchat: test-view-data.sh . Give a try to that script and let me know if you can also reproduce the issue. It is possible that you have to run the script 2-3 times before. I noticed that sometime, the count is OK. For the record, the issue is that my materialized view is not synchronized properly with the data in my main table. There is no hints or batchlog recorded. Output of the issue: MAIN TABLE COUNT Consistency level set to ALL. count ------- 8 (1 rows) MV TABLE COUNT - node1 Consistency level set to ALL. count ------- 7 (1 rows) \MAIN TABLE DATA Consistency level set to ALL. id | v ----+--- 5 | 5 1 | 1 8 | 8 2 | 2 4 | 4 7 | 7 6 | 6 3 | 3 (8 rows) MV TABLE DATA Consistency level set to ALL. v | id ---+---- 1 | 1 8 | 8 2 | 2 4 | 4 7 | 7 6 | 6 3 | 3 (7 rows)
          Hide
          carlyeks Carl Yeksigian added a comment -

          Branch which is based off 8099 branch.

          T Jake Luciani has also added support for composite partition keys in the MV table.

          Joshua McKenzie- this is now ready to start review.

          Show
          carlyeks Carl Yeksigian added a comment - Branch which is based off 8099 branch . T Jake Luciani has also added support for composite partition keys in the MV table. Joshua McKenzie - this is now ready to start review.
          Hide
          tjake T Jake Luciani added a comment -

          One minor requirement that wasn't spelled out is support for composite primary keys in the MV table.

          Show
          tjake T Jake Luciani added a comment - One minor requirement that wasn't spelled out is support for composite primary keys in the MV table.
          Hide
          tjake T Jake Luciani added a comment -

          Do you mean "so the stage can make progress on other updates while waiting" or do you mean actual deadlock?

          Correct, so other mutations can make progress. If you filled up the mutation queue with mutations on the same key you would virtually deadlock all others.

          Show
          tjake T Jake Luciani added a comment - Do you mean "so the stage can make progress on other updates while waiting" or do you mean actual deadlock? Correct, so other mutations can make progress. If you filled up the mutation queue with mutations on the same key you would virtually deadlock all others.
          Hide
          jbellis Jonathan Ellis added a comment -

          > We also made the tasks reschedule themselves in the stage if they are waiting for a lock and cannot acquire it, so that the stage doesn't deadlock.

          Do you mean "so the stage can make progress on other updates while waiting" or do you mean actual deadlock?

          Show
          jbellis Jonathan Ellis added a comment - > We also made the tasks reschedule themselves in the stage if they are waiting for a lock and cannot acquire it, so that the stage doesn't deadlock. Do you mean "so the stage can make progress on other updates while waiting" or do you mean actual deadlock?
          Hide
          carlyeks Carl Yeksigian added a comment -

          Update on this ticket:

          The consistency issue was caused by not using the batch log for the materialized view replica mutations. This meant that we weren't guaranteed that the materialized view would get all of the updates, including tombstones, which is why there were additional entries in the MV.

          Adding this fixed the consistency issue, but lengthened the time to execute a materialized view mutation, which exacerbated another issue: lock contention for the partition. Since we need to make sure that the MV and the base table are not updated until we have finished figuring out the new mutations; trying to acquire that lock also backs up the mutation stage and causes the batchlog mutations to fail to run. Much like counters has its own stage, we've move MV mutations to their own stage, as well as batchlog mutations. We also made the tasks reschedule themselves in the stage if they are waiting for a lock and cannot acquire it, so that the stage doesn't deadlock.

          These changes have improved MV so that we can overwrite a single key many times and get a consistent result back at the end of the run. We have a dtest which tests that scenario.

          Collections included in the view are properly handled now as well, so the range tombstones are created for the view properly; all of the types have now been verified by a unit test.

          The next steps are to rebase this on top of 8099 for 3.0.

          Many thanks to T Jake Luciani for all of his help looking into these problems.

          Show
          carlyeks Carl Yeksigian added a comment - Update on this ticket: The consistency issue was caused by not using the batch log for the materialized view replica mutations. This meant that we weren't guaranteed that the materialized view would get all of the updates, including tombstones, which is why there were additional entries in the MV. Adding this fixed the consistency issue, but lengthened the time to execute a materialized view mutation, which exacerbated another issue: lock contention for the partition. Since we need to make sure that the MV and the base table are not updated until we have finished figuring out the new mutations; trying to acquire that lock also backs up the mutation stage and causes the batchlog mutations to fail to run. Much like counters has its own stage, we've move MV mutations to their own stage, as well as batchlog mutations. We also made the tasks reschedule themselves in the stage if they are waiting for a lock and cannot acquire it, so that the stage doesn't deadlock. These changes have improved MV so that we can overwrite a single key many times and get a consistent result back at the end of the run. We have a dtest which tests that scenario. Collections included in the view are properly handled now as well, so the range tombstones are created for the view properly; all of the types have now been verified by a unit test. The next steps are to rebase this on top of 8099 for 3.0. Many thanks to T Jake Luciani for all of his help looking into these problems.
          Hide
          carlyeks Carl Yeksigian added a comment -

          1. No decision has been made on whether other refresh modes will be added, but the focus has only been on an eventually consistent mode.
          2. The build happens in the background. In the under development branch, there is no way to monitor the progress other than looking at the table on each node which stores the MV build progress.

          Show
          carlyeks Carl Yeksigian added a comment - 1. No decision has been made on whether other refresh modes will be added, but the focus has only been on an eventually consistent mode. 2. The build happens in the background. In the under development branch, there is no way to monitor the progress other than looking at the table on each node which stores the MV build progress.
          Hide
          jkrupan Jack Krupansky added a comment -

          1. Has a decision been made on refresh modes? It sounds like the focus is on "always consistent", as opposed to manual refresh or one-time without refresh or on some time interval, but is that simply the default, preferred refresh mode, or the only mode that will be available (initially)?

          2. What happens if an MV is created for a base table that is already populated? Will the operation block while all existing data is propagated to the MV, or will that propagation happen in the background (in which case, is there a way to monitor its status and completion?), or is that not supported (initially)?

          Show
          jkrupan Jack Krupansky added a comment - 1. Has a decision been made on refresh modes? It sounds like the focus is on "always consistent", as opposed to manual refresh or one-time without refresh or on some time interval, but is that simply the default, preferred refresh mode, or the only mode that will be available (initially)? 2. What happens if an MV is created for a base table that is already populated? Will the operation block while all existing data is propagated to the MV, or will that propagation happen in the background (in which case, is there a way to monitor its status and completion?), or is that not supported (initially)?
          Hide
          carlyeks Carl Yeksigian added a comment -

          Can someone clarify whether, based on the current implementation, the Materialized View (or some of the Materialized Views, if there are more than one) can contain the mutation but the base table does not - or vice versa?

          That's something that is still under development, hence the comments above that it is an outstanding issue. We're trying very hard to ensure that the materialized view that we ship is very unlikely to have a mutation in the base table which isn't in the view.

          Show
          carlyeks Carl Yeksigian added a comment - Can someone clarify whether, based on the current implementation, the Materialized View (or some of the Materialized Views, if there are more than one) can contain the mutation but the base table does not - or vice versa? That's something that is still under development, hence the comments above that it is an outstanding issue. We're trying very hard to ensure that the materialized view that we ship is very unlikely to have a mutation in the base table which isn't in the view.
          Hide
          brianmhess Brian Hess added a comment -

          Can someone clarify whether, based on the current implementation, the Materialized View (or some of the Materialized Views, if there are more than one) can contain the mutation but the base table does not - or vice versa? Can you clarify what happens in each of the failure scenarios? It will be very important for us to know for sure when we discuss this with users/customers.

          Show
          brianmhess Brian Hess added a comment - Can someone clarify whether, based on the current implementation, the Materialized View (or some of the Materialized Views, if there are more than one) can contain the mutation but the base table does not - or vice versa? Can you clarify what happens in each of the failure scenarios? It will be very important for us to know for sure when we discuss this with users/customers.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Because of a race on index maintenance, or because we're load shedding?

          It's a race - the base mutation shouldn't be applied if the view mutations (and tombstones) aren't. It is supposed to block until the view mutation is applied before applying the base mutation; a lock is being held by the mutating thread, so we shouldn't be doing other mutations at the same time.

          Show
          carlyeks Carl Yeksigian added a comment - Because of a race on index maintenance, or because we're load shedding? It's a race - the base mutation shouldn't be applied if the view mutations (and tombstones) aren't. It is supposed to block until the view mutation is applied before applying the base mutation; a lock is being held by the mutating thread, so we shouldn't be doing other mutations at the same time.
          Hide
          jbellis Jonathan Ellis added a comment -

          When overloaded with inserts to a single partition key in the base, tombstones aren't always properly generated for the view

          Because of a race on index maintenance, or because we're load shedding?

          Show
          jbellis Jonathan Ellis added a comment - When overloaded with inserts to a single partition key in the base, tombstones aren't always properly generated for the view Because of a race on index maintenance, or because we're load shedding?
          Hide
          carlyeks Carl Yeksigian added a comment -

          Pushed a new branch which includes the materialized view syntax Jonathan Ellis proposed:

          CREATE MATERIALIZED VIEW users_by_age AS
          SELECT age, user_id, x, y, z FROM users
          PRIMARY KEY (age, user_id)
          

          The branch also includes the consistency updates proposed earlier in this ticket, as well as initial support for collection.

          Main outstanding issues:

          • When overloaded with inserts to a single partition key in the base, tombstones aren't always properly generated for the view; this inconsistency doesn't get corrected with more time since repair doesn't check view versus base
          • An insert to a collection produces a range tombstone which isn't translated properly against the view; only range tombstones covering the whole row are handled
          Show
          carlyeks Carl Yeksigian added a comment - Pushed a new branch which includes the materialized view syntax Jonathan Ellis proposed: CREATE MATERIALIZED VIEW users_by_age AS SELECT age, user_id, x, y, z FROM users PRIMARY KEY (age, user_id) The branch also includes the consistency updates proposed earlier in this ticket, as well as initial support for collection. Main outstanding issues: When overloaded with inserts to a single partition key in the base, tombstones aren't always properly generated for the view; this inconsistency doesn't get corrected with more time since repair doesn't check view versus base An insert to a collection produces a range tombstone which isn't translated properly against the view; only range tombstones covering the whole row are handled
          Hide
          rustyrazorblade Jon Haddad added a comment - - edited

          Not a fan of materialized query tables personally. We frequently refer to the various tables that people need to create to query their data as materialized views, and people seem to understand immediately.

          Show
          rustyrazorblade Jon Haddad added a comment - - edited Not a fan of materialized query tables personally. We frequently refer to the various tables that people need to create to query their data as materialized views, and people seem to understand immediately.
          Hide
          brianmhess Brian Hess added a comment -

          I really don't need to re-ignite the nomenclature debate, but another name that might be more fitting with how we've talked about these tables in the past is DB2's name: Materialized Query Tables
          Anyone like or hate that name more?

          Show
          brianmhess Brian Hess added a comment - I really don't need to re-ignite the nomenclature debate, but another name that might be more fitting with how we've talked about these tables in the past is DB2's name: Materialized Query Tables Anyone like or hate that name more?
          Hide
          jkrupan Jack Krupansky added a comment -

          When exactly would population of the MV occur? What refresh options would initially be supported? Would population/refresh begin instantly when the MV is created, by default, or would an explicit command be required to begin population? Earlier I linked to the Oracle doc on MV, so a comparison to Oracle for refresh options might be nice, especially for users migrating from Oracle. Where would the state of refresh be stored, and how can a user monitor it? On each node of the base table?

          PostgreSQL doesn't seem to have as many options:
          http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html

          With RF>1, which of the nodes containing a given token would push an update to the MV? All of them? Presumably the push can be token-aware, so that each push only goes to RF=n nodes based on the PK of the MV insert row. Would a consistency level be warranted for the push? Would there be hints as well? And repair of an MV if the rate of updates of the base table overwhelms the update bandwidth of the (many) MVs for the base table?

          Any thoughts on throttling of the flow of updates from other nodes so that population of a MV does not overwhelm or interfere with normal cluster operation? What default, and what override? What would be a reasonable default, and what would be best practice advice for a maximum?

          Show
          jkrupan Jack Krupansky added a comment - When exactly would population of the MV occur? What refresh options would initially be supported? Would population/refresh begin instantly when the MV is created, by default, or would an explicit command be required to begin population? Earlier I linked to the Oracle doc on MV, so a comparison to Oracle for refresh options might be nice, especially for users migrating from Oracle. Where would the state of refresh be stored, and how can a user monitor it? On each node of the base table? PostgreSQL doesn't seem to have as many options: http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html With RF>1, which of the nodes containing a given token would push an update to the MV? All of them? Presumably the push can be token-aware, so that each push only goes to RF=n nodes based on the PK of the MV insert row. Would a consistency level be warranted for the push? Would there be hints as well? And repair of an MV if the rate of updates of the base table overwhelms the update bandwidth of the (many) MVs for the base table? Any thoughts on throttling of the flow of updates from other nodes so that population of a MV does not overwhelm or interfere with normal cluster operation? What default, and what override? What would be a reasonable default, and what would be best practice advice for a maximum?
          Hide
          carlyeks Carl Yeksigian added a comment -

          MV allow us to route queries for high-cardinality non-partition columns so that they are single node queries rather than having to fan-out on read. The reason this was renamed is that we want to prevent a fan-out after we get back the set of keys, so we store the data that we want to query with the keys, thus creating something closer to MV than global indexes. In addition, we want to make sure that users understand why they can't select arbitrary columns from the table; they will only be able to use the columns they specify when creating the MV.

          We also want to take something that is done by many different users currently, which is denormalizing the data into new tables, and bring that in so that we can make sure that the edge cases are handled properly, and make it easier for users to model their data.

          Is it fair to say that the primary technique for using this feature is to have one base table and n views of that table

          Yes, as many tables as the users currently create and use themselves, they should be able to create materialized views for.

          Would it also be sensible to select a subset of rows?

          This won't be in the initial version, but it is something that we will probably add on later, along with richer select statement support

          Show
          carlyeks Carl Yeksigian added a comment - MV allow us to route queries for high-cardinality non-partition columns so that they are single node queries rather than having to fan-out on read. The reason this was renamed is that we want to prevent a fan-out after we get back the set of keys, so we store the data that we want to query with the keys, thus creating something closer to MV than global indexes. In addition, we want to make sure that users understand why they can't select arbitrary columns from the table; they will only be able to use the columns they specify when creating the MV. We also want to take something that is done by many different users currently, which is denormalizing the data into new tables, and bring that in so that we can make sure that the edge cases are handled properly, and make it easier for users to model their data. Is it fair to say that the primary technique for using this feature is to have one base table and n views of that table Yes, as many tables as the users currently create and use themselves, they should be able to create materialized views for. Would it also be sensible to select a subset of rows? This won't be in the initial version, but it is something that we will probably add on later, along with richer select statement support
          Hide
          jkrupan Jack Krupansky added a comment -

          Back to the original description, will the revised MV purpose address the high cardinality issue? That may depend on what guidance the spec offers for how data modelers should set up the primary key columns in terms of partition (or routing!) columns vs. clustering columns.

          Is the basic concept that although the selected rows might be scattered across multiple nodes in the base table, the goal is that they would cluster together on a single node for the MV table based on careful specification of partition key columns in the MV?

          Show
          jkrupan Jack Krupansky added a comment - Back to the original description, will the revised MV purpose address the high cardinality issue? That may depend on what guidance the spec offers for how data modelers should set up the primary key columns in terms of partition (or routing!) columns vs. clustering columns. Is the basic concept that although the selected rows might be scattered across multiple nodes in the base table, the goal is that they would cluster together on a single node for the MV table based on careful specification of partition key columns in the MV?
          Hide
          jkrupan Jack Krupansky added a comment -

          Is it fair to say that the primary technique for using this feature is to have one base table and n views of that table, each with a different selection of the base columns as the primary key of the view, with all rows selected, possibly projected differently but with different keys?

          Would it also be sensible to select a subset of rows? Although that might confuse some users who might think it would give them sophisticated ad hoc queries when in fact the query column values are fixed. For example, select all rows for a specific state. In this way, it doesn't offer what a global index would offer.

          Show
          jkrupan Jack Krupansky added a comment - Is it fair to say that the primary technique for using this feature is to have one base table and n views of that table, each with a different selection of the base columns as the primary key of the view, with all rows selected, possibly projected differently but with different keys? Would it also be sensible to select a subset of rows? Although that might confuse some users who might think it would give them sophisticated ad hoc queries when in fact the query column values are fixed. For example, select all rows for a specific state. In this way, it doesn't offer what a global index would offer.
          Hide
          philipthompson Philip Thompson added a comment -

          We have some basic test cases written, but now that the plan is to do full MV instead of GI, those are all pretty stale, and probably won't help you.

          Show
          philipthompson Philip Thompson added a comment - We have some basic test cases written, but now that the plan is to do full MV instead of GI, those are all pretty stale, and probably won't help you.
          Hide
          jkrupan Jack Krupansky added a comment -

          Still waiting for an updated description for the ticket. In particular, what specific use cases is this feature designed to handle well, and is it definitively expert-only, or will there be use cases that are safe for normal users. The key thing (ha ha!) is whether this feature will provide capabilities to make it much easier for people to migrate from SQL to Cassandra in terms of the denormalization process, and do it in a way that people can pick up easily in Data Modeling 101 training. A couple of examples would help a lot - like test cases.

          Show
          jkrupan Jack Krupansky added a comment - Still waiting for an updated description for the ticket. In particular, what specific use cases is this feature designed to handle well, and is it definitively expert-only, or will there be use cases that are safe for normal users. The key thing (ha ha!) is whether this feature will provide capabilities to make it much easier for people to migrate from SQL to Cassandra in terms of the denormalization process, and do it in a way that people can pick up easily in Data Modeling 101 training. A couple of examples would help a lot - like test cases.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          having the underlying table being directly exposed make it clear that you can only access what you denormalize (this is obvious for MV, less with a GI) and simplify the read path (the CQL parts of the read path).

          That's one of the main reasons I want this to be fully MV, too. So, reiterating my +1.

          Show
          iamaleksey Aleksey Yeschenko added a comment - having the underlying table being directly exposed make it clear that you can only access what you denormalize (this is obvious for MV, less with a GI) and simplify the read path (the CQL parts of the read path). That's one of the main reasons I want this to be fully MV, too. So, reiterating my +1.
          Hide
          jbellis Jonathan Ellis added a comment -

          It also means you don't give up token-aware routing (CASSANDRA-8517).

          Show
          jbellis Jonathan Ellis added a comment - It also means you don't give up token-aware routing ( CASSANDRA-8517 ).
          Hide
          slebresne Sylvain Lebresne added a comment -

          I'm warming up to the idea of calling it MV but only IF we're committed to fleshing it out to match SELECT.

          +1. On top of having a clear path for added flexibility, having the underlying table being directly exposed make it clear that you can only access what you denormalize (this is obvious for MV, less with a GI) and simplify the read path (the CQL parts of the read path).

          Show
          slebresne Sylvain Lebresne added a comment - I'm warming up to the idea of calling it MV but only IF we're committed to fleshing it out to match SELECT. +1. On top of having a clear path for added flexibility, having the underlying table being directly exposed make it clear that you can only access what you denormalize (this is obvious for MV, less with a GI) and simplify the read path (the CQL parts of the read path).
          Hide
          carlyeks Carl Yeksigian added a comment - - edited
          1. It is going to be the same mechanism, but we don't want to use the same consistency as what the insert is. This way, we can ensure that at least one node has seen all of the updates, and thus we can generate the correct tombstone based on the previous values; we are trying to make the dependency between the data table and the index table redundant, so we need to make sure a quorum is involved in the write
          2. Each replica makes a GI update independently, based on the data that it has, which means that we might issue updates for an older update that hasn't made it to all of the replicas yet. To cut down on the amount of work that the indexes do, a pretty easy optimization is to just send the index mutation to the index replica that the data node will wait on instead of sending them to all of the index replicas
          3. If we ever get into a situation where we have data loss in either the base table or the index table (both would likely go together), we would really need to run a rebuild, since there is no guarantee that extra data wouldn't be present in the index which isn't in the data table. Otherwise, we can repair the data and index tables independently, so that a repair issued on the data table should also repair all of the global index tables
          Show
          carlyeks Carl Yeksigian added a comment - - edited It is going to be the same mechanism, but we don't want to use the same consistency as what the insert is. This way, we can ensure that at least one node has seen all of the updates, and thus we can generate the correct tombstone based on the previous values; we are trying to make the dependency between the data table and the index table redundant, so we need to make sure a quorum is involved in the write Each replica makes a GI update independently, based on the data that it has, which means that we might issue updates for an older update that hasn't made it to all of the replicas yet. To cut down on the amount of work that the indexes do, a pretty easy optimization is to just send the index mutation to the index replica that the data node will wait on instead of sending them to all of the index replicas If we ever get into a situation where we have data loss in either the base table or the index table (both would likely go together), we would really need to run a rebuild, since there is no guarantee that extra data wouldn't be present in the index which isn't in the data table. Otherwise, we can repair the data and index tables independently, so that a repair issued on the data table should also repair all of the global index tables
          Hide
          jbellis Jonathan Ellis added a comment -

          A couple questions:

          1. What is the thinking behind using a customized batchlog vs the existing one?
          2. Is each replica making GI updates, or just one?
          3. What do we do to GI on repair of the base data table?
          Show
          jbellis Jonathan Ellis added a comment - A couple questions: What is the thinking behind using a customized batchlog vs the existing one? Is each replica making GI updates, or just one? What do we do to GI on repair of the base data table?
          Hide
          carlyeks Carl Yeksigian added a comment -

          At NGCC, there was some discussion about how we are going to handle eventual consistency in this design.

          Benedict's proposal earlier in this ticket, where each data replica performs a local read-before-write and pushes out the value to the index replicas is the one which does provide eventual consistency guarantees in line with user's expectations. In order to make sure that the values are consistent in the GI, it will also need to use a batch log at quorum to ensure that the update eventually gets applied.

          Here is the process that GI will take on mutation:

          • If the mutations will be affected by global indexes (either the target or an included column), coordinator creates batch log which will wait for a quorum to ack mutations before being deleted
          • Coordinator pushes mutations out to data replicas
          • Data replica checks if mutation touches global index
          • Data replica takes lock on global index row in order to make sure that the generated mutations are consistent
          • Data replica creates index mutation, blocks until that is complete
          • Data replica applies data mutation, acks mutation

          The ordering of the index mutation before the data mutation means that if we fail at index mutation, we will generate the same values the next time this mutations is applied.

          We will send to exactly one replica, whose position in the ring relative to the token is equidistant as ours is.

          Show
          carlyeks Carl Yeksigian added a comment - At NGCC, there was some discussion about how we are going to handle eventual consistency in this design. Benedict's proposal earlier in this ticket, where each data replica performs a local read-before-write and pushes out the value to the index replicas is the one which does provide eventual consistency guarantees in line with user's expectations. In order to make sure that the values are consistent in the GI, it will also need to use a batch log at quorum to ensure that the update eventually gets applied. Here is the process that GI will take on mutation: If the mutations will be affected by global indexes (either the target or an included column), coordinator creates batch log which will wait for a quorum to ack mutations before being deleted Coordinator pushes mutations out to data replicas Data replica checks if mutation touches global index Data replica takes lock on global index row in order to make sure that the generated mutations are consistent Data replica creates index mutation, blocks until that is complete Data replica applies data mutation, acks mutation The ordering of the index mutation before the data mutation means that if we fail at index mutation, we will generate the same values the next time this mutations is applied. We will send to exactly one replica, whose position in the ring relative to the token is equidistant as ours is.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Do you mean that from an implementation perspective?

          Yes.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Do you mean that from an implementation perspective? Yes.
          Hide
          mbroecheler Matthias Broecheler added a comment -

          Sorry for a short reply. Just wanted to note that I see this being the other way around. Full-on materialized views are the infrastructure for 'global indexes', not the other way around.

          Do you mean that from an implementation perspective? Conceptually, I would argue it the other way around. When you build support for full-on materialized views, you still have to deal with the decision: is the primary key high or low selectivity, i.e. do I maintain this view locally or globally. And then, depending on that decision, you would utilize the 2i infrastructure (or something like it) or some other infrastructure for global maintenance of derived records.

          Plus, with full-on materialized views you are mixing in other challenges, like dealing with multiple columns (e.g. CASSANDRA-5402).
          So it seems, conceptually at least, that the initial atomic problem seems to be how to consistently maintain derived row records on remote nodes. That would be enough to build global indexes on a single column. From there, this could be extended to full-on materialized view maintenance.

          Show
          mbroecheler Matthias Broecheler added a comment - Sorry for a short reply. Just wanted to note that I see this being the other way around. Full-on materialized views are the infrastructure for 'global indexes', not the other way around. Do you mean that from an implementation perspective? Conceptually, I would argue it the other way around. When you build support for full-on materialized views, you still have to deal with the decision: is the primary key high or low selectivity, i.e. do I maintain this view locally or globally. And then, depending on that decision, you would utilize the 2i infrastructure (or something like it) or some other infrastructure for global maintenance of derived records. Plus, with full-on materialized views you are mixing in other challenges, like dealing with multiple columns (e.g. CASSANDRA-5402 ). So it seems, conceptually at least, that the initial atomic problem seems to be how to consistently maintain derived row records on remote nodes. That would be enough to build global indexes on a single column. From there, this could be extended to full-on materialized view maintenance.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Much like with 2i, once that is in place one can utilize the infrastructure to build other things on top - including materialized views which will need this to begin with (if the primary key of your materialized view has high selectivity).

          Sorry for a short reply. Just wanted to note that I see this being the other way around. Full-on materialized views are the infrastructure for 'global indexes', not the other way around.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Much like with 2i, once that is in place one can utilize the infrastructure to build other things on top - including materialized views which will need this to begin with (if the primary key of your materialized view has high selectivity). Sorry for a short reply. Just wanted to note that I see this being the other way around. Full-on materialized views are the infrastructure for 'global indexes', not the other way around.
          Hide
          mbroecheler Matthias Broecheler added a comment -

          I think the discussion around materialized views (which I would love to see in C* at some point) is distracting from what this ticket is really about: closing a hole in the indexing story for C*.

          In RDBMS (and pretty much all other database systems), indexes are used to efficiently retrieve a set of rows identified by their columns values in a particular order at the expense of write performance. By design, C* builds a 100% selectivity index on the primary key. In addition, one can install secondary indexes. Those secondary indexes are useful up to a certain selectivity %. Beyond that threshold, it becomes increasingly more efficient to maintain the index as a global distributed hash map rather than a local index on each node. And that's the hole in the indexing story, because those types of indexes must currently be maintained by the application.

          I am stating the obvious here to point out that the first problem is to provide the infrastructure to create that second class of indexes while ensuring some form of (eventual) consistency. Much like with 2i, once that is in place one can utilize the infrastructure to build other things on top - including materialized views which will need this to begin with (if the primary key of your materialized view has high selectivity).

          As for nomenclature, I agree that "global vs local" index is a technical distinction that has little to no meaning to the user. After all, they want to build an index to get to their data quickly. How that happens is highly secondary. Initially, it might make sense to ask the user to specify the selectivity estimate for the index (defaulting to low) and for C* to pick the best indexing approach based on that. In the future, one could utilize sampled histograms to help the user with that decision.

          Show
          mbroecheler Matthias Broecheler added a comment - I think the discussion around materialized views (which I would love to see in C* at some point) is distracting from what this ticket is really about: closing a hole in the indexing story for C*. In RDBMS (and pretty much all other database systems), indexes are used to efficiently retrieve a set of rows identified by their columns values in a particular order at the expense of write performance. By design, C* builds a 100% selectivity index on the primary key. In addition, one can install secondary indexes. Those secondary indexes are useful up to a certain selectivity %. Beyond that threshold, it becomes increasingly more efficient to maintain the index as a global distributed hash map rather than a local index on each node. And that's the hole in the indexing story, because those types of indexes must currently be maintained by the application. I am stating the obvious here to point out that the first problem is to provide the infrastructure to create that second class of indexes while ensuring some form of (eventual) consistency. Much like with 2i, once that is in place one can utilize the infrastructure to build other things on top - including materialized views which will need this to begin with (if the primary key of your materialized view has high selectivity). As for nomenclature, I agree that "global vs local" index is a technical distinction that has little to no meaning to the user. After all, they want to build an index to get to their data quickly. How that happens is highly secondary. Initially, it might make sense to ask the user to specify the selectivity estimate for the index (defaulting to low) and for C* to pick the best indexing approach based on that. In the future, one could utilize sampled histograms to help the user with that decision.
          Hide
          jkrupan Jack Krupansky added a comment -

          It would be helpful if someone were to update the description and primary use case(s) for this feature.

          My understanding of the original use case was to avoid the fan out from the coordinator node on an indexed query - the global index would contain the partition keys for matched rows so that only the node(s) containing those partition key(s) would be needed. So, my question at this stage is whether the intention is that the initial cut of MV would include a focus on that performance optimization use case, or merely focus on the increased general flexibility of MV instead. Would the initial implementation of MV even necessarily use a GI? Would local vs. global index be an option to be specified?

          Also, whether it is GI or MV, what guidance will the spec, doc, and training give users as to its performance and scalability? My concern with GI was that it works well for small to medium-sized clusters, but not with very large clusters. So, what would the largest cluster that a user could use a GI for? And also how many GI's make sense. For example, with 1 billion rows per node, and 50 nodes, and a GI on 10 columns, that would be... 1B * 50 * 10 = 500 billion index entries on each node, right? Seems like a bit much for a JVM heap or even off-heap memory. Maybe 500M * 20 * 4 = 40 billion index entries per node would be a wiser upper limit, and even that may be a bit extreme.

          Show
          jkrupan Jack Krupansky added a comment - It would be helpful if someone were to update the description and primary use case(s) for this feature. My understanding of the original use case was to avoid the fan out from the coordinator node on an indexed query - the global index would contain the partition keys for matched rows so that only the node(s) containing those partition key(s) would be needed. So, my question at this stage is whether the intention is that the initial cut of MV would include a focus on that performance optimization use case, or merely focus on the increased general flexibility of MV instead. Would the initial implementation of MV even necessarily use a GI? Would local vs. global index be an option to be specified? Also, whether it is GI or MV, what guidance will the spec, doc, and training give users as to its performance and scalability? My concern with GI was that it works well for small to medium-sized clusters, but not with very large clusters. So, what would the largest cluster that a user could use a GI for? And also how many GI's make sense. For example, with 1 billion rows per node, and 50 nodes, and a GI on 10 columns, that would be... 1B * 50 * 10 = 500 billion index entries on each node, right? Seems like a bit much for a JVM heap or even off-heap memory. Maybe 500M * 20 * 4 = 40 billion index entries per node would be a wiser upper limit, and even that may be a bit extreme.
          Hide
          jkrupan Jack Krupansky added a comment -

          Oracle has lots of options for the REFRESH clause of the CREATE MATERIALIZED VIEW statement:
          http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm

          Notes on that syntax:
          http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm#i2064161

          Full MV syntax:
          http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm

          You can request that a materialized view be automatically refreshed when the base tables are updated using the "REFRESH ON COMMIT" option. The update transaction pauses while the views are updated - "Specify ON COMMIT to indicate that a fast refresh is to occur whenever the database commits a transaction that operates on a master table of the materialized view. This clause may increase the time taken to complete the commit, because the database performs the refresh operation as part of the commit process."

          You can also refresh on time intervals, on demand, or no refresh ever. Originally MV was known as SNAPSHOT - a one-time snapshot of a view of the base tables/query.

          Oracle has a FAST refresh, which depends on a MATERIALIZED VIEW LOG, which must be created for the base table(s). Otherwise a COMPLETE refresh is required.

          Show
          jkrupan Jack Krupansky added a comment - Oracle has lots of options for the REFRESH clause of the CREATE MATERIALIZED VIEW statement: http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm Notes on that syntax: http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm#i2064161 Full MV syntax: http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm You can request that a materialized view be automatically refreshed when the base tables are updated using the "REFRESH ON COMMIT" option. The update transaction pauses while the views are updated - "Specify ON COMMIT to indicate that a fast refresh is to occur whenever the database commits a transaction that operates on a master table of the materialized view. This clause may increase the time taken to complete the commit, because the database performs the refresh operation as part of the commit process." You can also refresh on time intervals, on demand, or no refresh ever. Originally MV was known as SNAPSHOT - a one-time snapshot of a view of the base tables/query. Oracle has a FAST refresh, which depends on a MATERIALIZED VIEW LOG, which must be created for the base table(s). Otherwise a COMPLETE refresh is required.
          Hide
          jjirsa Jeff Jirsa added a comment -

          As an end user, MV + matching SELECT described above looks appealing, and would match what I would expect from hearing the name and being familiar with VIEWs in other databases.

          Show
          jjirsa Jeff Jirsa added a comment - As an end user, MV + matching SELECT described above looks appealing, and would match what I would expect from hearing the name and being familiar with VIEWs in other databases.
          Hide
          jbellis Jonathan Ellis added a comment -

          One advantage to MV is that people are somewhat more used to the MV lagging the underlying data. PG goes so far as requiring you to manually issue refresh commands. I think that makes them unusable, so I only mention that to illustrate an extreme – with that precedent, having us say "MV are eventually consistent" sounds quite reasonable!

          (Oracle has had self-updating MV forever. I'm not actually sure what kind of transactionality guarantees you get there.)

          Show
          jbellis Jonathan Ellis added a comment - One advantage to MV is that people are somewhat more used to the MV lagging the underlying data. PG goes so far as requiring you to manually issue refresh commands. I think that makes them unusable, so I only mention that to illustrate an extreme – with that precedent, having us say "MV are eventually consistent" sounds quite reasonable! (Oracle has had self-updating MV forever. I'm not actually sure what kind of transactionality guarantees you get there.)
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Aggregation is probably unfeasible. Having the rest of it would be amazing. +1 to that.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Aggregation is probably unfeasible. Having the rest of it would be amazing. +1 to that.
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm warming up to the idea of calling it MV but only IF we're committed to fleshing it out to match SELECT.

          To start that can match our current envisioned functionality:

          CREATE MATERIALIZED VIEW users_by_age AS
          SELECT age, user_id, x, y, z FROM users
          PRIMARY KEY (age, user_id)

          but next we need to add support for WHERE and UDF:

          CREATE MATERIALIZED VIEW users_by_age AS
          SELECT age, user_id, substring(phone_number, 3) AS area_code, y, z FROM users
          WHERE area_code in (512, 513, 514)
          PRIMARY KEY (age, user_id)

          Building the view should take advantage of local indexes where applicable.

          Ideally we would support aggregation in MV as well. Not sure if that is feasible.

          Show
          jbellis Jonathan Ellis added a comment - I'm warming up to the idea of calling it MV but only IF we're committed to fleshing it out to match SELECT. To start that can match our current envisioned functionality: CREATE MATERIALIZED VIEW users_by_age AS SELECT age, user_id, x, y, z FROM users PRIMARY KEY (age, user_id) but next we need to add support for WHERE and UDF: CREATE MATERIALIZED VIEW users_by_age AS SELECT age, user_id, substring(phone_number, 3) AS area_code, y, z FROM users WHERE area_code in (512, 513, 514) PRIMARY KEY (age, user_id) Building the view should take advantage of local indexes where applicable. Ideally we would support aggregation in MV as well. Not sure if that is feasible.
          Hide
          jbellis Jonathan Ellis added a comment -

          While I agree that a VIEW in the SQL world has more utility, so does SELECT

          Correcut, but a view should still be able to represent what ever SELECT can! It's not reasonable to expect it to do more, but it's absolutely reasonable to expect it to match, because that's the definition.

          My personal preference would be to have a "cardinality" option clause with option values like "low", "medium", "high", and "unique".

          I don't think we're at the point where we can afford that high a level of abstraction.

          Show
          jbellis Jonathan Ellis added a comment - While I agree that a VIEW in the SQL world has more utility, so does SELECT Correcut, but a view should still be able to represent what ever SELECT can! It's not reasonable to expect it to do more, but it's absolutely reasonable to expect it to match, because that's the definition. My personal preference would be to have a "cardinality" option clause with option values like "low", "medium", "high", and "unique". I don't think we're at the point where we can afford that high a level of abstraction.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Why not call the feature "high cardinality index" since that's the use case it is focused on, right?

          As I see it, the focus is on making denormalization trivial, not on indexing - which is why I agree with Sylvain that we should drop the KEYS-only part, and why 'index' in the name is misleading.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Why not call the feature "high cardinality index" since that's the use case it is focused on, right? As I see it, the focus is on making denormalization trivial, not on indexing - which is why I agree with Sylvain that we should drop the KEYS-only part, and why 'index' in the name is misleading.
          Hide
          jkrupan Jack Krupansky added a comment -

          Why not call the feature "high cardinality index" since that's the use case it is focused on, right?

          My personal preference would be to have a "cardinality" option clause with option values like "low", "medium", "high", and "unique". The default being "low". A global index would be implied for "high" and "unique" cardinality.

          Show
          jkrupan Jack Krupansky added a comment - Why not call the feature "high cardinality index" since that's the use case it is focused on, right? My personal preference would be to have a "cardinality" option clause with option values like "low", "medium", "high", and "unique". The default being "low". A global index would be implied for "high" and "unique" cardinality.
          Hide
          jjordan Jeremiah Jordan added a comment -

          My main issue is that "Global Index" means nothing to me, and I work in this space. So I think using the word materialized or denomarlized in the name would do a better job of making clear what they are. I don't really care what other words are used in the name .

          That being said, I don't see a problem with VIEW. While I agree that a "VIEW" in the SQL world has more utility, so does "SELECT:. Seems to me we give you a subset of VIEW, just like we give you a subset of SELECT. I don't think it is confusing to not be able to materialize an arbitrary query that has joins and stuff in it, as we don't let you do that stuff in normal queries. And if we flesh this out over time with functions, composites, partial, etc, you get closer and closer to what you can do with a traditional VIEW.

          Show
          jjordan Jeremiah Jordan added a comment - My main issue is that "Global Index" means nothing to me, and I work in this space. So I think using the word materialized or denomarlized in the name would do a better job of making clear what they are. I don't really care what other words are used in the name . That being said, I don't see a problem with VIEW. While I agree that a "VIEW" in the SQL world has more utility, so does "SELECT:. Seems to me we give you a subset of VIEW, just like we give you a subset of SELECT. I don't think it is confusing to not be able to materialize an arbitrary query that has joins and stuff in it, as we don't let you do that stuff in normal queries. And if we flesh this out over time with functions, composites, partial, etc, you get closer and closer to what you can do with a traditional VIEW.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          As a data point, my own preference would be to call them "denormalized views". As my preference would go to never do the "only keys", I can agree that calling them "index" is a bit misleading. But I can also agree that "materialized views" might confuse SQL users that might expect more from them than what we'll offer. So granted "denormalized views" doesn't reference an existing term but I see that as a feature. Plus denormalization is really what we're doing there so ...

          Fair enough. I'd +1 that.

          Show
          iamaleksey Aleksey Yeschenko added a comment - As a data point, my own preference would be to call them "denormalized views". As my preference would go to never do the "only keys", I can agree that calling them "index" is a bit misleading. But I can also agree that "materialized views" might confuse SQL users that might expect more from them than what we'll offer. So granted "denormalized views" doesn't reference an existing term but I see that as a feature. Plus denormalization is really what we're doing there so ... Fair enough. I'd +1 that.
          Hide
          jbellis Jonathan Ellis added a comment -

          Denormalized vs materialized is just using an unusual synonym. The problematic term is the "view" part.

          (Call them materialized indexes or denormalized indexes if you like, but don't call them views.)

          Show
          jbellis Jonathan Ellis added a comment - Denormalized vs materialized is just using an unusual synonym. The problematic term is the "view" part. (Call them materialized indexes or denormalized indexes if you like, but don't call them views.)
          Hide
          slebresne Sylvain Lebresne added a comment -

          As a data point, my own preference would be to call them "denormalized views". As my preference would go to never do the "only keys", I can agree that calling them "index" is a bit misleading. But I can also agree that "materialized views" might confuse SQL users that might expect more from them than what we'll offer. So granted "denormalized views" doesn't reference an existing term but I see that as a feature. Plus denormalization is really what we're doing there so ...

          Show
          slebresne Sylvain Lebresne added a comment - As a data point, my own preference would be to call them "denormalized views". As my preference would go to never do the "only keys", I can agree that calling them "index" is a bit misleading. But I can also agree that "materialized views" might confuse SQL users that might expect more from them than what we'll offer. So granted "denormalized views" doesn't reference an existing term but I see that as a feature. Plus denormalization is really what we're doing there so ...
          Hide
          jbellis Jonathan Ellis added a comment -

          Here is the difference: the definition of MV is to materialize an arbitrary query. That is not what we offer here.

          Show
          jbellis Jonathan Ellis added a comment - Here is the difference: the definition of MV is to materialize an arbitrary query . That is not what we offer here.