Details

      Description

      Local indexes are suitable for low-cardinality data, where spreading the index across the cluster is a Good Thing. However, for high-cardinality data, local indexes require querying most nodes in the cluster even if only a handful of rows is returned.

      1. test-view-data.sh
        3 kB
        Alan Boudreault
      2. users.yaml
        0.8 kB
        Alan Boudreault

        Issue Links

          Activity

          Hide
          jbellis Jonathan Ellis added a comment -

          Most application-maintained indexes solve this problem by denormalizing the base table row into the index entry. The problem is that this means we can't do lazy updates of the index; we need to keep the index perfectly (or, "eventually perfectly") in sync with the base table. Which in turns means we need to linearize updates to an indexed table. That was a performance hit but otherwise reasonable when we did that for local indexes; for partitioned indexes it's not feasible.

          I suppose we could punt and say "we'll give you a denormalized index but you have to swear that only one client will update any given row in that table at a time" which is actually a fairly common use case... but it does seem like the sort of thing that will bite the incautious user. Worse, it will appear to work but give subtly incorrect results.

          Show
          jbellis Jonathan Ellis added a comment - Most application-maintained indexes solve this problem by denormalizing the base table row into the index entry. The problem is that this means we can't do lazy updates of the index; we need to keep the index perfectly (or, "eventually perfectly") in sync with the base table. Which in turns means we need to linearize updates to an indexed table. That was a performance hit but otherwise reasonable when we did that for local indexes; for partitioned indexes it's not feasible. I suppose we could punt and say "we'll give you a denormalized index but you have to swear that only one client will update any given row in that table at a time" which is actually a fairly common use case... but it does seem like the sort of thing that will bite the incautious user. Worse, it will appear to work but give subtly incorrect results.
          Hide
          jbellis Jonathan Ellis added a comment -

          The most straightforward approach is to take a similar approach to our local indexes:

          1. At insert/update time, add a new index entry (as part of an atomic batch with the original update]), with the timestamp of the data cell
          2. At read time, fetch the rows indicated by the index and remove stale index entries. Since we delete with the same timestamp as the index entry, this is safe wrt concurrent updates
          3. We can still use compaction of the base table to clean out stale records, but this will now generate updates or hints to the index partition

          The big drawback is that reads require an O(N) multiget in the coordinator: reading the index entries is a single request, but then each row to fetch may be on a different replica.

          Put another way, this will give us indexes that are good at very high cardinality – ideally a single row for each indexed value – to go with our existing low-cardinality indexes, but we still have a hole for "medium cardinality" data.

          Show
          jbellis Jonathan Ellis added a comment - The most straightforward approach is to take a similar approach to our local indexes: At insert/update time, add a new index entry (as part of an atomic batch with the original update]), with the timestamp of the data cell At read time, fetch the rows indicated by the index and remove stale index entries. Since we delete with the same timestamp as the index entry, this is safe wrt concurrent updates We can still use compaction of the base table to clean out stale records, but this will now generate updates or hints to the index partition The big drawback is that reads require an O(N) multiget in the coordinator: reading the index entries is a single request, but then each row to fetch may be on a different replica. Put another way, this will give us indexes that are good at very high cardinality – ideally a single row for each indexed value – to go with our existing low-cardinality indexes, but we still have a hole for "medium cardinality" data.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          For the record, I think we should leave it to people's client code. We don't need more complexity on our read/write paths when this can be done client-side.

          Show
          iamaleksey Aleksey Yeschenko added a comment - For the record, I think we should leave it to people's client code. We don't need more complexity on our read/write paths when this can be done client-side.
          Hide
          jbellis Jonathan Ellis added a comment -

          The counterpoint is that we shouldn't require ~12 client codebases (if done by the driver) or 1000s (if done by app code) to invent this instead of the server.

          Show
          jbellis Jonathan Ellis added a comment - The counterpoint is that we shouldn't require ~12 client codebases (if done by the driver) or 1000s (if done by app code) to invent this instead of the server.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          The problem is that this means we can't do lazy updates of the index; we need to keep the index perfectly (or, "eventually perfectly") in sync with the base table.

          To clarify: Suppose you have you index on the age of users, and we have an entry for 24: user1 in the index table. Now two threads update user1's age; one to 25, and one to 26. Each thread will

          1. Read existing age
          2. Delete index entry for existing age
          3. Update user record and insert index entry for new age

          The problem is if each thread reads the existing age of 24, then we'll end up with both 25: user1 and {{26: user1} index entries. (Atomic batches do not help with this.) With normal indexes, we clean up stale entries at compaction + read time; we could still do this here but the performance penalty is a lot higher.

          Show
          jbellis Jonathan Ellis added a comment - - edited The problem is that this means we can't do lazy updates of the index; we need to keep the index perfectly (or, "eventually perfectly") in sync with the base table. To clarify: Suppose you have you index on the age of users, and we have an entry for 24: user1 in the index table. Now two threads update user1's age; one to 25, and one to 26. Each thread will Read existing age Delete index entry for existing age Update user record and insert index entry for new age The problem is if each thread reads the existing age of 24, then we'll end up with both 25: user1 and {{26: user1} index entries. (Atomic batches do not help with this.) With normal indexes, we clean up stale entries at compaction + read time; we could still do this here but the performance penalty is a lot higher.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          Sylvain had a different idea:

          Instead of just writing a 24, user1 tombstone, write a tombstone that indicates what the value changed to: 24, user1 -> 25 for one thread, and 24, user1 -> 26 for the other.

          When the tombstones are merged for compaction or read you can say, "Wait! 2 people tried to erase that, one with 25 the other with 26, let's check which one has a higher timestamp and delete any obsolete entries."

          Note that this requires reading the existing data row to get the old indexed value, but in exchange for introducing read-before-write we also get to add the desired denormalization into the index and no longer have to check the original data row for reach read.

          Show
          jbellis Jonathan Ellis added a comment - - edited Sylvain had a different idea: Instead of just writing a 24, user1 tombstone, write a tombstone that indicates what the value changed to: 24, user1 -> 25 for one thread, and 24, user1 -> 26 for the other. When the tombstones are merged for compaction or read you can say, "Wait! 2 people tried to erase that, one with 25 the other with 26, let's check which one has a higher timestamp and delete any obsolete entries." Note that this requires reading the existing data row to get the old indexed value, but in exchange for introducing read-before-write we also get to add the desired denormalization into the index and no longer have to check the original data row for reach read.
          Hide
          jbellis Jonathan Ellis added a comment -

          This does mean that a tombstone is not "just a tombstone," i.e., we will have to keep all tombstones of this time for gcgs or a similar period, not just "the most recent post-merge tombstone" as currently.

          But it should be relatively rare to have racing tombstones, so the penalty vs the status quo is not actually large in practice.

          /cc Matt Stump

          Show
          jbellis Jonathan Ellis added a comment - This does mean that a tombstone is not "just a tombstone," i.e., we will have to keep all tombstones of this time for gcgs or a similar period, not just "the most recent post-merge tombstone" as currently. But it should be relatively rare to have racing tombstones, so the penalty vs the status quo is not actually large in practice. /cc Matt Stump
          Hide
          slebresne Sylvain Lebresne added a comment -

          I'll note that the idea above has the downside to be only eventually consistent, but with no good user control about how eventual (we're dependent on when read/compaction happen to "heal" the "denormalized index").

          Show
          slebresne Sylvain Lebresne added a comment - I'll note that the idea above has the downside to be only eventually consistent, but with no good user control about how eventual (we're dependent on when read/compaction happen to "heal" the "denormalized index").
          Hide
          benedict Benedict added a comment -

          I may be being dim here, but it seems to me that with this scheme you would need to write a reverse record of 25, user1->replaced 24, so when you lookup on 25, you can then read 24 and check there were no competing updates? Either that or read the original record, which sort of defeats the point of denormalisation...

          Show
          benedict Benedict added a comment - I may be being dim here, but it seems to me that with this scheme you would need to write a reverse record of 25, user1->replaced 24, so when you lookup on 25, you can then read 24 and check there were no competing updates? Either that or read the original record, which sort of defeats the point of denormalisation...
          Hide
          jjordan Jeremiah Jordan added a comment - - edited

          I'll note that the idea above has the downside to be only eventually consistent, but with no good user control about how eventual (we're dependent on when read/compaction happen to "heal" the "denormalized index").

          I think this might be OK, as this is really only an issue in the case of a race, so both tombstones will end up in memtables and be resolved immediately, or in sstables written near each other in time (which should hopefully compact together fairly quickly). In both cases resolving the conflict should happen fairly quickly, though there are probably edge cases.

          The issue I see here is that compaction now has to issue queries, and we need to make sure those deletes issue by compaction MUST happen, or else the index will get out of whack, and we will have already thrown out the extra tombstone.

          Show
          jjordan Jeremiah Jordan added a comment - - edited I'll note that the idea above has the downside to be only eventually consistent, but with no good user control about how eventual (we're dependent on when read/compaction happen to "heal" the "denormalized index"). I think this might be OK, as this is really only an issue in the case of a race, so both tombstones will end up in memtables and be resolved immediately, or in sstables written near each other in time (which should hopefully compact together fairly quickly). In both cases resolving the conflict should happen fairly quickly, though there are probably edge cases. The issue I see here is that compaction now has to issue queries, and we need to make sure those deletes issue by compaction MUST happen, or else the index will get out of whack, and we will have already thrown out the extra tombstone.
          Hide
          jjordan Jeremiah Jordan added a comment -

          I may be being dim here, but it seems to me that with this scheme you would need to write a reverse record of 25, user1->replaced 24, so when you lookup on 25, you can then read 24 and check there were no competing updates? Either that or read the original record, which sort of defeats the point of denormalisation...

          No, you resolve it in compaction or on lookup of "24". Compaction sees the two different tombstones for 24 and then resolves them to the correct new value, deleting the wrong new value. Or a look up of "24" pulls in the two tombstones, resolves them to the correct one, deletes the wrong one, and returns none to the user.

          Show
          jjordan Jeremiah Jordan added a comment - I may be being dim here, but it seems to me that with this scheme you would need to write a reverse record of 25, user1->replaced 24, so when you lookup on 25, you can then read 24 and check there were no competing updates? Either that or read the original record, which sort of defeats the point of denormalisation... No, you resolve it in compaction or on lookup of "24". Compaction sees the two different tombstones for 24 and then resolves them to the correct new value, deleting the wrong new value. Or a look up of "24" pulls in the two tombstones, resolves them to the correct one, deletes the wrong one, and returns none to the user.
          Hide
          benedict Benedict added a comment -

          No, you resolve it in compaction or on lookup of "24".

          That only resolves deletes. How do you resolve seeing the wrong data?

          Show
          benedict Benedict added a comment - No, you resolve it in compaction or on lookup of "24". That only resolves deletes. How do you resolve seeing the wrong data ?
          Hide
          jbellis Jonathan Ellis added a comment -

          That's why Sylvain said, it's "eventually consistent, but with no good user control about how eventual."

          Show
          jbellis Jonathan Ellis added a comment - That's why Sylvain said, it's "eventually consistent, but with no good user control about how eventual."
          Hide
          jjordan Jeremiah Jordan added a comment -

          If you have the race, you may briefly see the other value, but its a race, and it would be just like you read before update #2 happened, so as long as the period of time where you can get the "wrong" data is small, it is ok.

          Show
          jjordan Jeremiah Jordan added a comment - If you have the race, you may briefly see the other value, but its a race, and it would be just like you read before update #2 happened, so as long as the period of time where you can get the "wrong" data is small, it is ok.
          Hide
          benedict Benedict added a comment -

          Jeremiah Jordan is that in response to me? Because I don't see how this would work: if both deleted 24 and inserted 25 and 26, then we now have a record of both 25 and 26 mapping to user1, despite only one of them being true, and no means of tidying it up. So people can indefinitely look up on both values. This is only resolved if we look up the original record after every 2i result, which maybe was always the plan. I'm not sure.

          Show
          benedict Benedict added a comment - Jeremiah Jordan is that in response to me? Because I don't see how this would work: if both deleted 24 and inserted 25 and 26, then we now have a record of both 25 and 26 mapping to user1, despite only one of them being true, and no means of tidying it up. So people can indefinitely look up on both values. This is only resolved if we look up the original record after every 2i result, which maybe was always the plan. I'm not sure.
          Hide
          jjordan Jeremiah Jordan added a comment -

          Benedict two threads update age = null. generate tombstones 24, user1->null, two of them, so those are OK and not a problem, updated to the same value, we also need to generate null: user1 as an append to the index. Then update age=25 generates tombstone null, user1->25 and age=26 generates tombstone null, user1->26. Those two tombstones will be resolved on compaction/memtable clash, or when someone asks for age=null as a query. This will require keeping track of null columns in the index. Something similar would need to be done for a full delete of the row.

          Show
          jjordan Jeremiah Jordan added a comment - Benedict two threads update age = null. generate tombstones 24, user1->null , two of them, so those are OK and not a problem, updated to the same value, we also need to generate null: user1 as an append to the index. Then update age=25 generates tombstone null, user1->25 and age=26 generates tombstone null, user1->26 . Those two tombstones will be resolved on compaction/memtable clash, or when someone asks for age=null as a query. This will require keeping track of null columns in the index. Something similar would need to be done for a full delete of the row.
          Hide
          benedict Benedict added a comment -

          New suggestion:

          Since we're performing read-before-write anyway with this suggestion, why not simply perform a local only read-before-write on each of the nodes that owns the main record whilst writing the update - instead of issuing a complex tombstone, we simply issue a delete for whichever value is older on reconcile. Since we always CAS local updates, we will never get missed deletes, however we will issue redundant/duplicate deletes (RF many) - but they should be coalesced in memtable almost always, so it's a network cost only. There are probably tricks we can do to mitigate this cost, though, e.g. having each node (deterministically) pick two of the possible owners of the 2i entry to send the deletes it encounters to, to minimise replication of effort but also ensure message delivery to all nodes.

          Result is we keep compaction logic exactly the same, and we retain approximately the same consistency guarantees we currently have.

          Show
          benedict Benedict added a comment - New suggestion: Since we're performing read-before-write anyway with this suggestion, why not simply perform a local only read-before-write on each of the nodes that owns the main record whilst writing the update - instead of issuing a complex tombstone, we simply issue a delete for whichever value is older on reconcile. Since we always CAS local updates, we will never get missed deletes, however we will issue redundant/duplicate deletes (RF many) - but they should be coalesced in memtable almost always, so it's a network cost only. There are probably tricks we can do to mitigate this cost, though, e.g. having each node (deterministically) pick two of the possible owners of the 2i entry to send the deletes it encounters to, to minimise replication of effort but also ensure message delivery to all nodes. Result is we keep compaction logic exactly the same, and we retain approximately the same consistency guarantees we currently have.
          Hide
          jbellis Jonathan Ellis added a comment -

          (Renaming this to Global Indexes, borrowing DynamoDB's term.)

          Show
          jbellis Jonathan Ellis added a comment - (Renaming this to Global Indexes, borrowing DynamoDB's term.)
          Hide
          rstml Rustam Aliyev added a comment - - edited

          In addition to performance, one of the key advantages of application-maintained global indexes is flexibility. I think it's important to preserve it in built-in global indexes. Few cases I think important to consider:

          1. Composite index. Global index can be based on more than one column.
          2. Range query on indexed elements. With high cardinality global index it would be efficient to allow range query on elements to make consecutive multiget efficient. For example, indexing time-series data by type and then looking up with ... TYPE="type1" and ID > minTimeuuid('2013-02-02 10:00+0000')
          3. Reverse key index. Should be able to define index clustering key (i.e. indexed elements) order (ASC, DESC). Helpful when used with range queries above. E.g.
            CREATE GLOBAL INDEX dpt_emp_idx ON employee(department_id DESC, name ASC); 
          4. Function based index. In this case, index is defined by transformation function. For example, lowercase(value) or arithmetic function like (field1 * field2).
          5. Storing data in index. Typically, global indexes have following structure where values are nulls:
            "idx_table" {
               "index_value1" : {
                   "el_id1" : null,
                   "el_id5" : null,
                   ...
               }
            }
            

            However, sometimes it's efficient and convenient to keep some information in values. For example, let's assume that elements above contains tens of fields. However, in 90% cases application uses only one of those e.g. hash. In that case, it's efficient to scan index and retrieve hash values directly from index instead of doing additional lookup to original table. Above table would looks like:

            "idx_table" {
               "index_value1" : {
                   "el_id1" : "74335a7c9229...",
                   "el_id5" : "28b986fa29eb...",
                   ...
               }
            }
            

          Traditional RDBMS support most of these indexes. For function based indexes we could create a bunch of functions in CQL3 (e.g. Math.*, LOWERCASE(), etc.) similar to other RDBMS.

          Alternatively, we can achieve greater flexibility by storing optional Java 8 lambda functions. Lambda function will take mutated row as an input and return 2 vars:

          1. non-empty set of indexes (required)
          2. map of id -> value which will be used to lookup stored index values (optional). If element not found, null is stored.

          CREATE INDEX statement has to define produced index CQL type and optionally stored index values:

          CREATE GLOBAL INDEX account_by_email_idx ON accounts ( LAMBDA("row -> { return row.email.toLowerCase(); }") ) WITH INDEX_TYPE = {'text'};
          

          More examples:

          1. Lowercase email:
             row -> { return row.email.toLowerCase(); } 
          2. Distance between coordinates:
             row -> { return Math.sqrt((row.x1-row.x2)*(row.x1-row.x2) + (row.y1-row.y2)*(row.y1-row.y2)); } 
          3. Conditional index:
             row -> { return row.price > 0 ? "paid" : "free"; } 
          4. Indexes with values (item 5 above) may require some special return type (e.g. IndexWithValues). In the example above, message length will be stored in the index:
             row -> { return new IndexWithValues(row.type, row.message.length()); } 

          Querying these indexes is another caveat. Consider distance between coordinates example above - what would be SELECT statement for this index? With application-maintained global indexes, application can just lookup in index using given value. Same applies to indexes with stored values.

          Without these, built-in global indexes will be very limited and once again, application-maintained global indexes would remain as go to solution.

          Show
          rstml Rustam Aliyev added a comment - - edited In addition to performance, one of the key advantages of application-maintained global indexes is flexibility. I think it's important to preserve it in built-in global indexes. Few cases I think important to consider: Composite index. Global index can be based on more than one column. Range query on indexed elements. With high cardinality global index it would be efficient to allow range query on elements to make consecutive multiget efficient. For example, indexing time-series data by type and then looking up with ... TYPE="type1" and ID > minTimeuuid('2013-02-02 10:00+0000') Reverse key index. Should be able to define index clustering key (i.e. indexed elements) order (ASC, DESC). Helpful when used with range queries above. E.g. CREATE GLOBAL INDEX dpt_emp_idx ON employee(department_id DESC, name ASC); Function based index. In this case, index is defined by transformation function. For example, lowercase(value) or arithmetic function like (field1 * field2). Storing data in index. Typically, global indexes have following structure where values are nulls: "idx_table" { "index_value1" : { "el_id1" : null , "el_id5" : null , ... } } However, sometimes it's efficient and convenient to keep some information in values. For example, let's assume that elements above contains tens of fields. However, in 90% cases application uses only one of those e.g. hash. In that case, it's efficient to scan index and retrieve hash values directly from index instead of doing additional lookup to original table. Above table would looks like: "idx_table" { "index_value1" : { "el_id1" : "74335a7c9229..." , "el_id5" : "28b986fa29eb..." , ... } } Traditional RDBMS support most of these indexes. For function based indexes we could create a bunch of functions in CQL3 (e.g. Math.*, LOWERCASE(), etc.) similar to other RDBMS. Alternatively, we can achieve greater flexibility by storing optional Java 8 lambda functions. Lambda function will take mutated row as an input and return 2 vars: non-empty set of indexes (required) map of id -> value which will be used to lookup stored index values (optional). If element not found, null is stored. CREATE INDEX statement has to define produced index CQL type and optionally stored index values: CREATE GLOBAL INDEX account_by_email_idx ON accounts ( LAMBDA( "row -> { return row.email.toLowerCase(); }" ) ) WITH INDEX_TYPE = {'text'}; More examples: Lowercase email: row -> { return row.email.toLowerCase(); } Distance between coordinates: row -> { return Math .sqrt((row.x1-row.x2)*(row.x1-row.x2) + (row.y1-row.y2)*(row.y1-row.y2)); } Conditional index: row -> { return row.price > 0 ? "paid" : "free" ; } Indexes with values (item 5 above) may require some special return type (e.g. IndexWithValues ). In the example above, message length will be stored in the index: row -> { return new IndexWithValues(row.type, row.message.length()); } Querying these indexes is another caveat. Consider distance between coordinates example above - what would be SELECT statement for this index? With application-maintained global indexes, application can just lookup in index using given value. Same applies to indexes with stored values. Without these, built-in global indexes will be very limited and once again, application-maintained global indexes would remain as go to solution.
          Hide
          jbellis Jonathan Ellis added a comment -

          We have separate tickets for functional, partial, and compound indexes. Let's not boil the ocean all at once.

          Show
          jbellis Jonathan Ellis added a comment - We have separate tickets for functional, partial, and compound indexes. Let's not boil the ocean all at once.
          Hide
          mstump Matt Stump added a comment -

          After walking through the various options I can't see a hole in the logic that Benedict outlined in his April 01 comment, and I'm drawn to it's simplicity. I'm starting on rev1 of an implementation. I'm going to leave out the delete traffic optimization he mentioned and instead focus on correctness. It might take me slightly longer than normal as I come up to speed on the codebase.

          Show
          mstump Matt Stump added a comment - After walking through the various options I can't see a hole in the logic that Benedict outlined in his April 01 comment, and I'm drawn to it's simplicity. I'm starting on rev1 of an implementation. I'm going to leave out the delete traffic optimization he mentioned and instead focus on correctness. It might take me slightly longer than normal as I come up to speed on the codebase.
          Hide
          benedict Benedict added a comment -

          There's a possible simplification of my proposed algorithm: if we start by forcing writes/reads to quorum if they touch a global index, we could simply proxy each write deterministically to one other replica, so long as we wait for both writes to complete before a single source/table/undelrying replica reports success.

          We need to bikeshed nomenclature a bit as well: we already have the term "primary replica" reserved for regular replication. It will help is we can disambiguate between this and the source/underlying table replica(s). So we'd have primary table replica, secondary table replica, primary index replica, secondary index replica? Or something along those lines.

          Show
          benedict Benedict added a comment - There's a possible simplification of my proposed algorithm: if we start by forcing writes/reads to quorum if they touch a global index, we could simply proxy each write deterministically to one other replica, so long as we wait for both writes to complete before a single source/table/undelrying replica reports success. We need to bikeshed nomenclature a bit as well: we already have the term "primary replica" reserved for regular replication. It will help is we can disambiguate between this and the source/underlying table replica(s). So we'd have primary table replica, secondary table replica, primary index replica, secondary index replica? Or something along those lines.
          Hide
          tjake T Jake Luciani added a comment -

          Is the plan to have a single partition per index or will we split this across X buckets?

          Show
          tjake T Jake Luciani added a comment - Is the plan to have a single partition per index or will we split this across X buckets?
          Hide
          jbellis Jonathan Ellis added a comment -

          Start with a single partition. We can look into configurable bucketing later, if necessary.

          Show
          jbellis Jonathan Ellis added a comment - Start with a single partition. We can look into configurable bucketing later, if necessary.
          Hide
          pbailis Peter Bailis added a comment -

          Just to follow up regarding our conversations at the NGCC: is there any interest at this point in delivering any form of consistent index updates (e.g., beyond "eventually consistent" index entries), or is the primary goal right now simply to get basic global index functionality working?

          Also, FWIW, though I have yet to think through Benedict's second proposal, the "local CAS" approach makes a lot of sense from my perspective insofar as you're willing to tolerate RF-1 redundant (but, importantly, idempotent!) index invalidations. I think this will work especially well when, per T Jake Luciani's proposal, you start partitioning the secondary index servers.

          Show
          pbailis Peter Bailis added a comment - Just to follow up regarding our conversations at the NGCC: is there any interest at this point in delivering any form of consistent index updates (e.g., beyond "eventually consistent" index entries), or is the primary goal right now simply to get basic global index functionality working? Also, FWIW, though I have yet to think through Benedict 's second proposal, the "local CAS" approach makes a lot of sense from my perspective insofar as you're willing to tolerate RF-1 redundant (but, importantly, idempotent!) index invalidations. I think this will work especially well when, per T Jake Luciani 's proposal, you start partitioning the secondary index servers.
          Hide
          tjake T Jake Luciani added a comment -

          Peter Bailis I think initially we are looking to start with EC and once we have RAMP we will incorporate it.

          Show
          tjake T Jake Luciani added a comment - Peter Bailis I think initially we are looking to start with EC and once we have RAMP we will incorporate it.
          Hide
          mfiguiere Michaël Figuière added a comment -

          Something that hasn't been mentioned so far in this thread is client side's Token Aware Balancing. So far when a query relies on a Secondary Index, it doesn't have any partition key specified in its where clause which means that the driver will fall back on the underlying balancing policy, which is DC aware round robin by default in the DS Java Driver. That's the appropriate behavior as there's no node that can be better than another as a coordinator in this situation.

          With Global Indexes, in order for the Driver to still be able to perform Token Aware Balancing, it'll need to be able to figure out which index will be used, which doesn't seems to be always trivial in the above examples, especially if several columns and/or indexes are involved in the WHERE clause. So here we might need to include an extra information about it in the Prepare response message of the Native Protocol, if the indexes to be involved can be figured out at Prepare-time.

          Show
          mfiguiere Michaël Figuière added a comment - Something that hasn't been mentioned so far in this thread is client side's Token Aware Balancing. So far when a query relies on a Secondary Index, it doesn't have any partition key specified in its where clause which means that the driver will fall back on the underlying balancing policy, which is DC aware round robin by default in the DS Java Driver. That's the appropriate behavior as there's no node that can be better than another as a coordinator in this situation. With Global Indexes, in order for the Driver to still be able to perform Token Aware Balancing, it'll need to be able to figure out which index will be used, which doesn't seems to be always trivial in the above examples, especially if several columns and/or indexes are involved in the WHERE clause. So here we might need to include an extra information about it in the Prepare response message of the Native Protocol, if the indexes to be involved can be figured out at Prepare-time.
          Hide
          jbellis Jonathan Ellis added a comment -

          We can create a followup ticket for that, but I don't want the core functionality to block on it.

          Show
          jbellis Jonathan Ellis added a comment - We can create a followup ticket for that, but I don't want the core functionality to block on it.
          Hide
          mfiguiere Michaël Figuière added a comment - - edited

          That sounds reasonable indeed as this is just an optimization that doesn't alter the overall semantic of the queries. I've created CASSANDRA-8517 to follow up on this point.

          Show
          mfiguiere Michaël Figuière added a comment - - edited That sounds reasonable indeed as this is just an optimization that doesn't alter the overall semantic of the queries. I've created CASSANDRA-8517 to follow up on this point.
          Hide
          carlyeks Carl Yeksigian added a comment -

          The work I have so far is on this branch; dtests are here.

          Syntax for creating a global index:

          CREATE GLOBAL INDEX ON foo (bar) DENORMALIZED (baz)
          

          A query against foo can now also be against bar, just like a normal 2i query would. Internally, that is translated into a new query on the index table. Only denormalized values can currently be selected.

          Things still left to do before this is ready:

          • When creating an index, need to build the index from the existing data.
            This is going to be very similar to CASSANDRA-8234
          • DeletionInfo are not handled
            This involves reading out all of the data which is going to be deleted and created tombstones for the index
          • Collections, TTL’d columns are not handled
          • Denormalizing all the columns (DENORMALIZED (*))
            This is in the spec, but hasn't been implemented yet

          Right now, it is being handled in SP.mutateWithTriggers, instead of doing this in Mutation.apply. This was mainly so that the global index update code is separate from the rest of the write path, hopefully making it more concise and preventing interference with the write path when not in use.

          Show
          carlyeks Carl Yeksigian added a comment - The work I have so far is on this branch ; dtests are here . Syntax for creating a global index: CREATE GLOBAL INDEX ON foo (bar) DENORMALIZED (baz) A query against foo can now also be against bar , just like a normal 2i query would. Internally, that is translated into a new query on the index table. Only denormalized values can currently be selected. Things still left to do before this is ready: When creating an index, need to build the index from the existing data. This is going to be very similar to CASSANDRA-8234 DeletionInfo are not handled This involves reading out all of the data which is going to be deleted and created tombstones for the index Collections, TTL’d columns are not handled Denormalizing all the columns ( DENORMALIZED (*) ) This is in the spec, but hasn't been implemented yet Right now, it is being handled in SP.mutateWithTriggers, instead of doing this in Mutation.apply. This was mainly so that the global index update code is separate from the rest of the write path, hopefully making it more concise and preventing interference with the write path when not in use.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Oleg Anastasyev given that you've done something similar before, can you have a look at the branch/share the common issues you had with your implementation?

          Show
          iamaleksey Aleksey Yeschenko added a comment - Oleg Anastasyev given that you've done something similar before, can you have a look at the branch/share the common issues you had with your implementation?
          Hide
          jbellis Jonathan Ellis added a comment -

          This is going to be very similar to CASSANDRA-8234

          Similar conceptually, but much simpler in practice since 8234 envisions arbitrary joins and subqueries that you need something like hadoop or spark for. Here we're just scanning a single table. Easy enough to send a message out to the cluster to emit inserts on the 2i table for their local data. (And yes, we're fine with taking the hit of doing this inefficiently for now. To optimize you could pick just one replica to do the repaired parts, and then everyone does the unrepaired.)

          Denormalizing all the columns (DENORMALIZED )

          IMO this should be the default, and WITH DENORMALIZED (...) syntax could optionally restrict it.

          Show
          jbellis Jonathan Ellis added a comment - This is going to be very similar to CASSANDRA-8234 Similar conceptually, but much simpler in practice since 8234 envisions arbitrary joins and subqueries that you need something like hadoop or spark for. Here we're just scanning a single table. Easy enough to send a message out to the cluster to emit inserts on the 2i table for their local data. (And yes, we're fine with taking the hit of doing this inefficiently for now. To optimize you could pick just one replica to do the repaired parts, and then everyone does the unrepaired.) Denormalizing all the columns (DENORMALIZED ) IMO this should be the default, and WITH DENORMALIZED (...) syntax could optionally restrict it.
          Hide
          slebresne Sylvain Lebresne added a comment -

          IMO this should be the default, and WITH DENORMALIZED (...) syntax could optionally restrict it.

          If you mean that CREATE GLOBAL INDEX on foo(bar) would be equivalent to denormalizing everything, then I might prefer being careful in the sense that we may (emphasis on "may") want this to later mean to create the global index as a true index, without anything denormalized. So my suggestion would be simply plainly reject a CREATE GLOBAL INDEX that doesn't have a DENORMALIZED clause (so no default in a way). Though I would agree that DENORMALIZED * should be there as it's probably what we want to recommend as de-facto default.

          Show
          slebresne Sylvain Lebresne added a comment - IMO this should be the default, and WITH DENORMALIZED (...) syntax could optionally restrict it. If you mean that CREATE GLOBAL INDEX on foo(bar) would be equivalent to denormalizing everything, then I might prefer being careful in the sense that we may (emphasis on "may") want this to later mean to create the global index as a true index, without anything denormalized. So my suggestion would be simply plainly reject a CREATE GLOBAL INDEX that doesn't have a DENORMALIZED clause (so no default in a way). Though I would agree that DENORMALIZED * should be there as it's probably what we want to recommend as de-facto default.
          Hide
          jbellis Jonathan Ellis added a comment -

          Good point.

          I still like the extra WITH for consistency w/ our CREATE TABLE syntax tho.

          Show
          jbellis Jonathan Ellis added a comment - Good point. I still like the extra WITH for consistency w/ our CREATE TABLE syntax tho.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I can't say no to consistency

          Show
          slebresne Sylvain Lebresne added a comment - I can't say no to consistency
          Hide
          carlyeks Carl Yeksigian added a comment -

          Similar conceptually, but much simpler in practice since 8234 envisions arbitrary joins and subqueries that you need something like hadoop or spark for.

          I wanted to get the idea out that this initial implementation will be simpler and possibly more inefficient because it isn't solving 8234, but that we can always change later.

          I think adding the global index to the schema should be enough for all of the nodes to insert the values that they have. Either strategy will have some problems when a replica fails that was supposed to take care of a part of the range.

          Show
          carlyeks Carl Yeksigian added a comment - Similar conceptually, but much simpler in practice since 8234 envisions arbitrary joins and subqueries that you need something like hadoop or spark for. I wanted to get the idea out that this initial implementation will be simpler and possibly more inefficient because it isn't solving 8234, but that we can always change later. I think adding the global index to the schema should be enough for all of the nodes to insert the values that they have. Either strategy will have some problems when a replica fails that was supposed to take care of a part of the range.
          Hide
          m0nstermind Oleg Anastasyev added a comment - - edited

          Sorry for a slight delay. Here are my throughts, hope you'll find them useful:

          1. Composite indexes are the most useful feature of GI. Majority of our GI are composite, some with different clustering order defined. And this is not much more work to implement composite indexes by the way. Composite partition keys on GI CF are also used to split otherwise wide partitions of global index with popular and frequently changing values. These otherwise wide partitions suffer from too much range tombstones (obviously, on modification of indexed value a range tombstone is generated to global index CF. still, it surprises ppl).

          2. As I can see from changes to CQL syntax and CFMetaData.java, compaction and compression props are copied from base CF to global index CF. GI CFs could have row sizes, update behaviour, reads and writes very different from its base CF, so specifying compaction, compression properties as well as other available in CREATE TABLE WITH clause could be useful.

          The syntax for GI with composite keys then could be eg:

          CREATE GLOBAL INDEX indexname ON baseCF( (partk1,partk2),clustkey1,... ) DENORMALIZED ... WITH <all the with properties of normal table>
          

          ( I'd also suggest to replace the keyword "DENORMALIZED" with something more familiar to SQL ppl, like "INCLUDE", eg in https://msdn.microsoft.com/en-us/library/ms190806.aspx )

          3. I am not sure forbidding to create global index on the column with existing 2i is good idea. We use 2 modes for global index: Right after global index is created only writes to the new index are activated. No reads from it are allowed while the base CF dataset is scanned and its data copied to the global index. If there are another global or 2i available on the same columns which could be used for reads - they are used. After a build is complete, operator can enable the just built index for read using ALTER GLOBAL INDEX ENABLE statement, and disable old indexes. This makes transition to GI and changing the structure of GI smoother from operational perspective. In case of something go wrong, operator just disables new and re-enables old indexes in no-time. Applying the same write-only/read-write mode switch here could make ppl transition from 2i to GI easier. This feature also makes on-the-run rebuild of GI possible, which could be useful until all bugs with inconsistent global index updates would be fixed.

          4. The base CF old data scan to fill data into new global index consistently with base CF is another tricky process, to which I came after several trials and errors. It has no external dependencies and most work is performed locally on C* nodes. You may find it useful as well.
          It breaks into 6 stages:
          1. First of all a new empty table to hold index data is created.
          2. Index writes are started on all nodes. So new modifications start to fill the index. At this moment new index is disabled for reading.
          3. C* nodes launch the primary range repair procedure on the base table to make sure all replicas of it are the same.
          4. C* nodes each scan their primary ranges of the data locally in the base table and fill index memtable and preparing data to stream to other nodes in parallel.
          5. Then they stream necessary data to other nodes, according to partitioning schema and replica count.
          6. When streaming completes, index is ready and enabled for clients to read as a final step either automatically or by operator command.

          Some smaller issues I found in the Carl's branch are:

          1. Not sure, how base table schema evolution is supported on fully denormalized global index. If column is added to base table it must be added to GI. Same with drop of column in base table, drop of the base table itself.
          2. It looks like due to order of schema modifications in CreateGlobalIndexStatement.announceMigration (define global index to base CF and then create the GI CF itself) and DropGlobalIndex (drop GI CF and then deregister it from base table) there will be mutations to unknown CF, when schema modifications not applied fully to all nodes of cluster. I'd suggest creating GI CF first, then register it as global index in base table metadata.
          3. Not sure, is delete of the row from base table implemented right. It seems like current implemenmtation of MutationUnit.oldValueIfUpdated interprets deletion as no-change to global index.
          4. StorageProxy.mutateAtomic in case of concurrent modifications of the base row will produce inconsistent records to global index, but as I understood this is to be resolved later.

          Show
          m0nstermind Oleg Anastasyev added a comment - - edited Sorry for a slight delay. Here are my throughts, hope you'll find them useful: 1. Composite indexes are the most useful feature of GI. Majority of our GI are composite, some with different clustering order defined. And this is not much more work to implement composite indexes by the way. Composite partition keys on GI CF are also used to split otherwise wide partitions of global index with popular and frequently changing values. These otherwise wide partitions suffer from too much range tombstones (obviously, on modification of indexed value a range tombstone is generated to global index CF. still, it surprises ppl). 2. As I can see from changes to CQL syntax and CFMetaData.java, compaction and compression props are copied from base CF to global index CF. GI CFs could have row sizes, update behaviour, reads and writes very different from its base CF, so specifying compaction, compression properties as well as other available in CREATE TABLE WITH clause could be useful. The syntax for GI with composite keys then could be eg: CREATE GLOBAL INDEX indexname ON baseCF( (partk1,partk2),clustkey1,... ) DENORMALIZED ... WITH <all the with properties of normal table> ( I'd also suggest to replace the keyword "DENORMALIZED" with something more familiar to SQL ppl, like "INCLUDE", eg in https://msdn.microsoft.com/en-us/library/ms190806.aspx ) 3. I am not sure forbidding to create global index on the column with existing 2i is good idea. We use 2 modes for global index: Right after global index is created only writes to the new index are activated. No reads from it are allowed while the base CF dataset is scanned and its data copied to the global index. If there are another global or 2i available on the same columns which could be used for reads - they are used. After a build is complete, operator can enable the just built index for read using ALTER GLOBAL INDEX ENABLE statement, and disable old indexes. This makes transition to GI and changing the structure of GI smoother from operational perspective. In case of something go wrong, operator just disables new and re-enables old indexes in no-time. Applying the same write-only/read-write mode switch here could make ppl transition from 2i to GI easier. This feature also makes on-the-run rebuild of GI possible, which could be useful until all bugs with inconsistent global index updates would be fixed. 4. The base CF old data scan to fill data into new global index consistently with base CF is another tricky process, to which I came after several trials and errors. It has no external dependencies and most work is performed locally on C* nodes. You may find it useful as well. It breaks into 6 stages: 1. First of all a new empty table to hold index data is created. 2. Index writes are started on all nodes. So new modifications start to fill the index. At this moment new index is disabled for reading. 3. C* nodes launch the primary range repair procedure on the base table to make sure all replicas of it are the same. 4. C* nodes each scan their primary ranges of the data locally in the base table and fill index memtable and preparing data to stream to other nodes in parallel. 5. Then they stream necessary data to other nodes, according to partitioning schema and replica count. 6. When streaming completes, index is ready and enabled for clients to read as a final step either automatically or by operator command. Some smaller issues I found in the Carl's branch are: 1. Not sure, how base table schema evolution is supported on fully denormalized global index. If column is added to base table it must be added to GI. Same with drop of column in base table, drop of the base table itself. 2. It looks like due to order of schema modifications in CreateGlobalIndexStatement.announceMigration (define global index to base CF and then create the GI CF itself) and DropGlobalIndex (drop GI CF and then deregister it from base table) there will be mutations to unknown CF, when schema modifications not applied fully to all nodes of cluster. I'd suggest creating GI CF first, then register it as global index in base table metadata. 3. Not sure, is delete of the row from base table implemented right. It seems like current implemenmtation of MutationUnit.oldValueIfUpdated interprets deletion as no-change to global index. 4. StorageProxy.mutateAtomic in case of concurrent modifications of the base row will produce inconsistent records to global index, but as I understood this is to be resolved later.
          Hide
          benedict Benedict added a comment -

          Sorry for a slight delay.

          Unrelated to the topic at hand, but 4 days including the weekend is pretty prompt for such a thorough reply, beating the average for us core devs handily

          Show
          benedict Benedict added a comment - Sorry for a slight delay. Unrelated to the topic at hand, but 4 days including the weekend is pretty prompt for such a thorough reply, beating the average for us core devs handily
          Hide
          rstml Rustam Aliyev added a comment -

          +1 for replacing DENORMALISED with INCLUDE

          DynamoDB has similar capability which is referred as Projection (http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Projection.html) and there are 3 supported projection types: ALL, KEYS_ONLY and INCLUDE.

          Show
          rstml Rustam Aliyev added a comment - +1 for replacing DENORMALISED with INCLUDE DynamoDB has similar capability which is referred as Projection ( http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Projection.html ) and there are 3 supported projection types: ALL, KEYS_ONLY and INCLUDE .
          Hide
          carlyeks Carl Yeksigian added a comment -

          I pushed an updated branch here. From the last list, the only remaining outstanding issue is collection cells. I'm inclined to exclude those from the initial implementation.

          Oleg Anastasyev: these are all great suggestions, but we should add some of these as follow on tickets. It makes sense to change to INCLUDE, so I've replaced DENORMALIZED already, and I changed the order of the mutation statements (good catch!).
          Issues we should address in new tickets:

          1. Composite Indexes
          2. Changing index CF properties
          3. 2i and GI mixed mode
            I'm still not convinced that you should have 2i and GI on the same column, since they are for very different use cases, even in the transitional period
          4. For the initial build, I think there are probably a lot of ways to make it faster than what I have, but this is just a first cut
            An idea I've had is to create mutations from repaired sstables on a single replica; for unrepaired data, all the replicas would need to create them
          5. Currently, base table evolution is not handled, and any such modification will actually cause an IRE, but should be addressed
          Show
          carlyeks Carl Yeksigian added a comment - I pushed an updated branch here . From the last list, the only remaining outstanding issue is collection cells. I'm inclined to exclude those from the initial implementation. Oleg Anastasyev : these are all great suggestions, but we should add some of these as follow on tickets. It makes sense to change to INCLUDE , so I've replaced DENORMALIZED already, and I changed the order of the mutation statements (good catch!). Issues we should address in new tickets: Composite Indexes Changing index CF properties 2i and GI mixed mode I'm still not convinced that you should have 2i and GI on the same column, since they are for very different use cases, even in the transitional period For the initial build, I think there are probably a lot of ways to make it faster than what I have, but this is just a first cut An idea I've had is to create mutations from repaired sstables on a single replica; for unrepaired data, all the replicas would need to create them Currently, base table evolution is not handled, and any such modification will actually cause an IRE, but should be addressed
          Hide
          jbellis Jonathan Ellis added a comment -

          Currently, base table evolution is not handled, and any such modification will actually cause an IRE, but should be addressed

          If this means I can't add new columns once I create a GI, I think we need to address this here and not in another ticket.

          (But I also think it's fine by saying that new columns in the base table are NOT automatically indexed. Or put another way, INCLUDE ALL means "all columns currently present" and not "all columns that may be added in the future.")

          Show
          jbellis Jonathan Ellis added a comment - Currently, base table evolution is not handled, and any such modification will actually cause an IRE, but should be addressed If this means I can't add new columns once I create a GI, I think we need to address this here and not in another ticket. (But I also think it's fine by saying that new columns in the base table are NOT automatically indexed. Or put another way, INCLUDE ALL means "all columns currently present" and not "all columns that may be added in the future.")
          Hide
          carlyeks Carl Yeksigian added a comment -

          Looking at the changes I made, it is currently only disallowing dropping/altering/renaming the target column.

          Here's a list of the actions and how the GI could handle them:

          • DROP
          • If the target column, either error and say to drop GI first, or drop GI with the column drop
          • If included, GI drops that column
          • ALTER should be handled the same way for GI as the base table
          • RENAME of the target is allowed, since we are storing the CF ID separately and don't have the same problem as 2i in CASSANDRA-5705; everything else is handled if it would be handled under the normal RENAME
          • ADD a column to a CF with a GI INCLUDE (*) should add column as well
          • WITH should change the parameters of GI CF as well
          Show
          carlyeks Carl Yeksigian added a comment - Looking at the changes I made, it is currently only disallowing dropping/altering/renaming the target column. Here's a list of the actions and how the GI could handle them: DROP If the target column, either error and say to drop GI first, or drop GI with the column drop If included, GI drops that column ALTER should be handled the same way for GI as the base table RENAME of the target is allowed, since we are storing the CF ID separately and don't have the same problem as 2i in CASSANDRA-5705 ; everything else is handled if it would be handled under the normal RENAME ADD a column to a CF with a GI INCLUDE (*) should add column as well WITH should change the parameters of GI CF as well
          Hide
          philipthompson Philip Thompson added a comment -

          Currently it is impossible to do a query such as SELECT * FROM table WHERE globally_indexed_column = ? AND unindexed_column = ? ALLOW FILTERING. Carl Yeksigian suggests that should work. What other ALLOW FILTERING options should be valid? GI column AND 2i column? GI column and GI column?

          Show
          philipthompson Philip Thompson added a comment - Currently it is impossible to do a query such as SELECT * FROM table WHERE globally_indexed_column = ? AND unindexed_column = ? ALLOW FILTERING . Carl Yeksigian suggests that should work. What other ALLOW FILTERING options should be valid? GI column AND 2i column? GI column and GI column?
          Hide
          jbellis Jonathan Ellis added a comment -

          I see. Sounds reasonable.

          Show
          jbellis Jonathan Ellis added a comment - I see. Sounds reasonable.
          Hide
          jbellis Jonathan Ellis added a comment -

          On the one hand, all queries that would be valid w/o GI should still be valid – with similar caveats to 2i, i.e., filtering on 2 GI columns won't actually be faster than 1 GI and 1 non-indexed-but-included column, until we have actual multi-column indexes.

          On the other hand, GI introduces a new way to introduce a query that is much slower than expected – if you filter on a column that is not included in the GI, then you need to fall back to seq scan on the base table.

          Show
          jbellis Jonathan Ellis added a comment - On the one hand, all queries that would be valid w/o GI should still be valid – with similar caveats to 2i, i.e., filtering on 2 GI columns won't actually be faster than 1 GI and 1 non-indexed-but-included column, until we have actual multi-column indexes. On the other hand, GI introduces a new way to introduce a query that is much slower than expected – if you filter on a column that is not included in the GI, then you need to fall back to seq scan on the base table.
          Hide
          pateljay3001 Jay Patel added a comment -

          Glad to see lot of activities on this ticket recently.
          +1 for write-only/read-write index switch. This can simplify multiple Ops tasks including implementing “online" index rebuild.

          From the code, looks like currently all de-norm/include columns are added as regular columns in the GI index table. Later on (as a different ticket?), how about allowing user to define some include columns as “sort columns”, which we can add as clustering columns in the index table to support sorting and range queries (on sort columns). However, the sort columns need to be defined upfront. And, adding/removing sort columns can require rebuilding index. Or, any better idea to support ordering/range queries?

          We’ve built similar generic GI framework at the client side which supports ordering/range queries/multi-columns/collections/etc. But eventually, it makes sense to use C* GI capabilities once available. To contribute to these GI efforts, should we wait until the first cut gets merge to master? or, can fork from ticket/6477-2?

          Question: Regarding the Benedict’s concurrent update suggestion (01/Apr/14), just want to confirm that it will work for concurrent updates in case of multi-dc set up with local quorum. I think index may be inaccurate initially, but eventually get corrected as primary data replicates across data centers. Currently, we're solving this concurrency & stale index issue by async read-repair for the index (sort of similar to what C* does to fix replicas).

          Thanks!

          Show
          pateljay3001 Jay Patel added a comment - Glad to see lot of activities on this ticket recently. +1 for write-only/read-write index switch. This can simplify multiple Ops tasks including implementing “online" index rebuild. From the code, looks like currently all de-norm/include columns are added as regular columns in the GI index table. Later on (as a different ticket?), how about allowing user to define some include columns as “sort columns”, which we can add as clustering columns in the index table to support sorting and range queries (on sort columns). However, the sort columns need to be defined upfront. And, adding/removing sort columns can require rebuilding index. Or, any better idea to support ordering/range queries? We’ve built similar generic GI framework at the client side which supports ordering/range queries/multi-columns/collections/etc. But eventually, it makes sense to use C* GI capabilities once available. To contribute to these GI efforts, should we wait until the first cut gets merge to master? or, can fork from ticket/6477-2? Question: Regarding the Benedict’s concurrent update suggestion (01/Apr/14), just want to confirm that it will work for concurrent updates in case of multi-dc set up with local quorum. I think index may be inaccurate initially, but eventually get corrected as primary data replicates across data centers. Currently, we're solving this concurrency & stale index issue by async read-repair for the index (sort of similar to what C* does to fix replicas). Thanks!
          Hide
          carlyeks Carl Yeksigian added a comment -

          I've pushed up a few updates to the branch. There was an issue with clustering columns which Philip Thompson pointed out, which has been resolved; an index on a clustering column works properly now.

          Currently, ALLOW FILTERING queries are disallowed. We can change that behavior later, but it should be a follow-on ticket. Also disabled ins any index including Collections, either as target or an included column. This should also be added, but it will require a substantial rewrite, and makes sense to hold off on until the major rewrite after CASSANDRA-8099.

          Jay Patel: I think it makes sense to wait until this gets committed. Especially with the storage engine refactor, there may come significant changes to the global indexes, so any work would need to be redone later.

          It currently doesn't have any sort of read-repair because it doesn't work at the level of mutation applications. It also doesn't have a proper repair built in; we can use the Builder from the creation part to add a rebuild, but we would have to drop the data currently stored to make sure we aren't returning previous results.

          Show
          carlyeks Carl Yeksigian added a comment - I've pushed up a few updates to the branch. There was an issue with clustering columns which Philip Thompson pointed out, which has been resolved; an index on a clustering column works properly now. Currently, ALLOW FILTERING queries are disallowed. We can change that behavior later, but it should be a follow-on ticket. Also disabled ins any index including Collections, either as target or an included column. This should also be added, but it will require a substantial rewrite, and makes sense to hold off on until the major rewrite after CASSANDRA-8099 . Jay Patel : I think it makes sense to wait until this gets committed. Especially with the storage engine refactor, there may come significant changes to the global indexes, so any work would need to be redone later. It currently doesn't have any sort of read-repair because it doesn't work at the level of mutation applications. It also doesn't have a proper repair built in; we can use the Builder from the creation part to add a rebuild, but we would have to drop the data currently stored to make sure we aren't returning previous results.
          Hide
          slebresne Sylvain Lebresne added a comment - - edited

          Let's recall that the main problem here is how to keep the index consistent with the original table. And that's typically a problem if say 2 clients simulatenously update the same column to 2 different values: we need to make sure that we end up with only whatever of those update wins in the index.

          Since for global indexes we know we'll have to do a read before write, what has been suggested here is to do that on replicas, at which point we can serialize concurrent updates locally to make sure things end up consistent. Now, we could do that on every replica but this has a few downsides:

          1. every replica will update the index and we'll do RF times too many index updates.
          2. once a replica has done his read and computed the update for the data table and the index table, we want to put both of those in a batch mutation to avoid inconsistencies in case of failures. This makes write more expensive and thus the duplication of work all that less desirable.

          To avoid that duplication, one possibility is to reuse the same technique we use for counters: have the coordinator push the update to one random replica, and have that one replica do the read before write and push everything (data and index updates) through a batchlog mutation.

          The currently linked branch doesn't do all of that yet so it'll have to be added before we can commit this.

          On top of this, I think that we'll need 2 other things that are not handled yet by the branch:

          • being able to index table that have collections. Indexing collections, which is also not yet supported, can probably be left to a follow-up ticket.
          • make sure we hook the index rebuild with streaming so that when the data table is repaired we do repair the index too.

          Once those have been tackled, I think we can call it good for an initial version and let other improvements to follow-ups.

          Show
          slebresne Sylvain Lebresne added a comment - - edited Let's recall that the main problem here is how to keep the index consistent with the original table. And that's typically a problem if say 2 clients simulatenously update the same column to 2 different values: we need to make sure that we end up with only whatever of those update wins in the index. Since for global indexes we know we'll have to do a read before write, what has been suggested here is to do that on replicas, at which point we can serialize concurrent updates locally to make sure things end up consistent. Now, we could do that on every replica but this has a few downsides: every replica will update the index and we'll do RF times too many index updates. once a replica has done his read and computed the update for the data table and the index table, we want to put both of those in a batch mutation to avoid inconsistencies in case of failures. This makes write more expensive and thus the duplication of work all that less desirable. To avoid that duplication, one possibility is to reuse the same technique we use for counters: have the coordinator push the update to one random replica, and have that one replica do the read before write and push everything (data and index updates) through a batchlog mutation. The currently linked branch doesn't do all of that yet so it'll have to be added before we can commit this. On top of this, I think that we'll need 2 other things that are not handled yet by the branch: being able to index table that have collections. Indexing collections, which is also not yet supported, can probably be left to a follow-up ticket. make sure we hook the index rebuild with streaming so that when the data table is repaired we do repair the index too. Once those have been tackled, I think we can call it good for an initial version and let other improvements to follow-ups.
          Hide
          carlyeks Carl Yeksigian added a comment -

          For counters, we can push to any random replica because we have a shard per replica and sum them at read time. For the global index update, we're trying to make sure that simultaneous updates are applied to the table and index in a serializable order. As such, it wouldn't really work to push the update to any random replica; if two replicas get pushed competing updates, we end up in the same situation. Instead, we would need to have a single replica that we push all index updates through, which would break if the replica dies for any reason.

          Also, for the complex tombstones, it would very difficult to determine whether a grouping of updates indicates simultaneous updates or just a series of updates that happened to use the same values. For instance, if the user does 24 => 25 => 26 => 24 => 26 => 25, we would get a bunch of complex tombstones 24 => 25, 25 => 26, 26 => 24, 24 => 26, 26 => 25. When cleaning them up, it isn't going to be clear whether we have the case as described before where 24 is simultaneously updated to 25 and 26, or whether it was a loop of updates as described here.

          Show
          carlyeks Carl Yeksigian added a comment - For counters, we can push to any random replica because we have a shard per replica and sum them at read time. For the global index update, we're trying to make sure that simultaneous updates are applied to the table and index in a serializable order. As such, it wouldn't really work to push the update to any random replica; if two replicas get pushed competing updates, we end up in the same situation. Instead, we would need to have a single replica that we push all index updates through, which would break if the replica dies for any reason. Also, for the complex tombstones, it would very difficult to determine whether a grouping of updates indicates simultaneous updates or just a series of updates that happened to use the same values. For instance, if the user does 24 => 25 => 26 => 24 => 26 => 25, we would get a bunch of complex tombstones 24 => 25 , 25 => 26 , 26 => 24 , 24 => 26 , 26 => 25 . When cleaning them up, it isn't going to be clear whether we have the case as described before where 24 is simultaneously updated to 25 and 26, or whether it was a loop of updates as described here.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          we would need to have a single replica that we push all index updates through

          You're right, that doesn't work.

          it would very difficult to determine whether a grouping of updates indicates simultaneous updates or just a series of updates that happened to use the same values

          I don't think that makes it not work. Each of those values is a separate partition. I think we get the right answer if we just take the highest-timestamped value for each partition and delete any "ghost" entries that may have been missed.

          Show
          jbellis Jonathan Ellis added a comment - - edited we would need to have a single replica that we push all index updates through You're right, that doesn't work. it would very difficult to determine whether a grouping of updates indicates simultaneous updates or just a series of updates that happened to use the same values I don't think that makes it not work. Each of those values is a separate partition. I think we get the right answer if we just take the highest-timestamped value for each partition and delete any "ghost" entries that may have been missed.
          Hide
          slebresne Sylvain Lebresne added a comment -

          if two replicas get pushed competing updates, we end up in the same situation.

          You're right, I brain farted on that one.

          it would very difficult to determine whether a grouping of updates indicates simultaneous updates

          It shouldn't be because we'll have the timestamps to decide the order.

          Show
          slebresne Sylvain Lebresne added a comment - if two replicas get pushed competing updates, we end up in the same situation. You're right, I brain farted on that one. it would very difficult to determine whether a grouping of updates indicates simultaneous updates It shouldn't be because we'll have the timestamps to decide the order.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          As discussed at NGCC, it might be a good idea to rename global indexes to materialized views, which seems to be a more appropriate, and less confusing, title.

          Show
          iamaleksey Aleksey Yeschenko added a comment - As discussed at NGCC, it might be a good idea to rename global indexes to materialized views, which seems to be a more appropriate, and less confusing, title.
          Hide
          jjordan Jeremiah Jordan added a comment -

          +1 from me for calling it "Materialized Views"

          Show
          jjordan Jeremiah Jordan added a comment - +1 from me for calling it "Materialized Views"
          Hide
          jbellis Jonathan Ellis added a comment -

          Materialized views is a much broader feature. I think users will be disappointed with the limitations if we try to call it that.

          (And DynamoDB has been using the GI term for a year and a half now, I'm not in favor of forking terminology without a clearly superior alternative.)

          Show
          jbellis Jonathan Ellis added a comment - Materialized views is a much broader feature. I think users will be disappointed with the limitations if we try to call it that. (And DynamoDB has been using the GI term for a year and a half now, I'm not in favor of forking terminology without a clearly superior alternative.)
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I see it mostly used for the automated denormalization, tbh, and not for the KEYS part, so I'd argue that materialized views is a more fitting name.

          Plus, with the upcoming SASI changes, there would be a huge gap between what local indexes/materialized views offer, in terms of expressiveness, and calling both 'indexes' would bring on more confusion in the long term.

          Also, I personally couldn't care less about the names used by DDB. I'd rather stick closer to what SQL has, since people coming from SQL world, and not people coming from DynamoDB, are our target audience.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I see it mostly used for the automated denormalization, tbh, and not for the KEYS part, so I'd argue that materialized views is a more fitting name. Plus, with the upcoming SASI changes, there would be a huge gap between what local indexes/materialized views offer, in terms of expressiveness, and calling both 'indexes' would bring on more confusion in the long term. Also, I personally couldn't care less about the names used by DDB. I'd rather stick closer to what SQL has, since people coming from SQL world, and not people coming from DynamoDB, are our target audience.
          Hide
          jbellis Jonathan Ellis added a comment -

          calling both 'indexes' would bring on more confusion in the long term

          Disagreed. RDBMS users have lived with hash indexes vs btree indexes vs bitmap indexes with different abilities for a long time. (Notably hash indexes have exactly the limitations of global indexes.)

          MV otoh has a completely different feature set that GI doesn't even start to offer.

          Also, I personally couldn't care less about the names used by DDB. I'd rather stick closer to what SQL has, since people coming from SQL world, and not people coming from DynamoDB, are our target audience.

          My point is that whether you come from SQL or NoSQL, MV is the wrong choice.

          Show
          jbellis Jonathan Ellis added a comment - calling both 'indexes' would bring on more confusion in the long term Disagreed. RDBMS users have lived with hash indexes vs btree indexes vs bitmap indexes with different abilities for a long time. (Notably hash indexes have exactly the limitations of global indexes.) MV otoh has a completely different feature set that GI doesn't even start to offer. Also, I personally couldn't care less about the names used by DDB. I'd rather stick closer to what SQL has, since people coming from SQL world, and not people coming from DynamoDB, are our target audience. My point is that whether you come from SQL or NoSQL, MV is the wrong choice.
          Hide
          jbellis Jonathan Ellis added a comment -

          Here is the difference: the definition of MV is to materialize an arbitrary query. That is not what we offer here.

          Show
          jbellis Jonathan Ellis added a comment - Here is the difference: the definition of MV is to materialize an arbitrary query . That is not what we offer here.
          Hide
          slebresne Sylvain Lebresne added a comment -

          As a data point, my own preference would be to call them "denormalized views". As my preference would go to never do the "only keys", I can agree that calling them "index" is a bit misleading. But I can also agree that "materialized views" might confuse SQL users that might expect more from them than what we'll offer. So granted "denormalized views" doesn't reference an existing term but I see that as a feature. Plus denormalization is really what we're doing there so ...

          Show
          slebresne Sylvain Lebresne added a comment - As a data point, my own preference would be to call them "denormalized views". As my preference would go to never do the "only keys", I can agree that calling them "index" is a bit misleading. But I can also agree that "materialized views" might confuse SQL users that might expect more from them than what we'll offer. So granted "denormalized views" doesn't reference an existing term but I see that as a feature. Plus denormalization is really what we're doing there so ...
          Hide
          jbellis Jonathan Ellis added a comment -

          Denormalized vs materialized is just using an unusual synonym. The problematic term is the "view" part.

          (Call them materialized indexes or denormalized indexes if you like, but don't call them views.)

          Show
          jbellis Jonathan Ellis added a comment - Denormalized vs materialized is just using an unusual synonym. The problematic term is the "view" part. (Call them materialized indexes or denormalized indexes if you like, but don't call them views.)
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          As a data point, my own preference would be to call them "denormalized views". As my preference would go to never do the "only keys", I can agree that calling them "index" is a bit misleading. But I can also agree that "materialized views" might confuse SQL users that might expect more from them than what we'll offer. So granted "denormalized views" doesn't reference an existing term but I see that as a feature. Plus denormalization is really what we're doing there so ...

          Fair enough. I'd +1 that.

          Show
          iamaleksey Aleksey Yeschenko added a comment - As a data point, my own preference would be to call them "denormalized views". As my preference would go to never do the "only keys", I can agree that calling them "index" is a bit misleading. But I can also agree that "materialized views" might confuse SQL users that might expect more from them than what we'll offer. So granted "denormalized views" doesn't reference an existing term but I see that as a feature. Plus denormalization is really what we're doing there so ... Fair enough. I'd +1 that.
          Hide
          jjordan Jeremiah Jordan added a comment -

          My main issue is that "Global Index" means nothing to me, and I work in this space. So I think using the word materialized or denomarlized in the name would do a better job of making clear what they are. I don't really care what other words are used in the name .

          That being said, I don't see a problem with VIEW. While I agree that a "VIEW" in the SQL world has more utility, so does "SELECT:. Seems to me we give you a subset of VIEW, just like we give you a subset of SELECT. I don't think it is confusing to not be able to materialize an arbitrary query that has joins and stuff in it, as we don't let you do that stuff in normal queries. And if we flesh this out over time with functions, composites, partial, etc, you get closer and closer to what you can do with a traditional VIEW.

          Show
          jjordan Jeremiah Jordan added a comment - My main issue is that "Global Index" means nothing to me, and I work in this space. So I think using the word materialized or denomarlized in the name would do a better job of making clear what they are. I don't really care what other words are used in the name . That being said, I don't see a problem with VIEW. While I agree that a "VIEW" in the SQL world has more utility, so does "SELECT:. Seems to me we give you a subset of VIEW, just like we give you a subset of SELECT. I don't think it is confusing to not be able to materialize an arbitrary query that has joins and stuff in it, as we don't let you do that stuff in normal queries. And if we flesh this out over time with functions, composites, partial, etc, you get closer and closer to what you can do with a traditional VIEW.
          Hide
          jkrupan Jack Krupansky added a comment -

          Why not call the feature "high cardinality index" since that's the use case it is focused on, right?

          My personal preference would be to have a "cardinality" option clause with option values like "low", "medium", "high", and "unique". The default being "low". A global index would be implied for "high" and "unique" cardinality.

          Show
          jkrupan Jack Krupansky added a comment - Why not call the feature "high cardinality index" since that's the use case it is focused on, right? My personal preference would be to have a "cardinality" option clause with option values like "low", "medium", "high", and "unique". The default being "low". A global index would be implied for "high" and "unique" cardinality.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Why not call the feature "high cardinality index" since that's the use case it is focused on, right?

          As I see it, the focus is on making denormalization trivial, not on indexing - which is why I agree with Sylvain that we should drop the KEYS-only part, and why 'index' in the name is misleading.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Why not call the feature "high cardinality index" since that's the use case it is focused on, right? As I see it, the focus is on making denormalization trivial, not on indexing - which is why I agree with Sylvain that we should drop the KEYS-only part, and why 'index' in the name is misleading.
          Hide
          jbellis Jonathan Ellis added a comment -

          While I agree that a VIEW in the SQL world has more utility, so does SELECT

          Correcut, but a view should still be able to represent what ever SELECT can! It's not reasonable to expect it to do more, but it's absolutely reasonable to expect it to match, because that's the definition.

          My personal preference would be to have a "cardinality" option clause with option values like "low", "medium", "high", and "unique".

          I don't think we're at the point where we can afford that high a level of abstraction.

          Show
          jbellis Jonathan Ellis added a comment - While I agree that a VIEW in the SQL world has more utility, so does SELECT Correcut, but a view should still be able to represent what ever SELECT can! It's not reasonable to expect it to do more, but it's absolutely reasonable to expect it to match, because that's the definition. My personal preference would be to have a "cardinality" option clause with option values like "low", "medium", "high", and "unique". I don't think we're at the point where we can afford that high a level of abstraction.
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm warming up to the idea of calling it MV but only IF we're committed to fleshing it out to match SELECT.

          To start that can match our current envisioned functionality:

          CREATE MATERIALIZED VIEW users_by_age AS
          SELECT age, user_id, x, y, z FROM users
          PRIMARY KEY (age, user_id)

          but next we need to add support for WHERE and UDF:

          CREATE MATERIALIZED VIEW users_by_age AS
          SELECT age, user_id, substring(phone_number, 3) AS area_code, y, z FROM users
          WHERE area_code in (512, 513, 514)
          PRIMARY KEY (age, user_id)

          Building the view should take advantage of local indexes where applicable.

          Ideally we would support aggregation in MV as well. Not sure if that is feasible.

          Show
          jbellis Jonathan Ellis added a comment - I'm warming up to the idea of calling it MV but only IF we're committed to fleshing it out to match SELECT. To start that can match our current envisioned functionality: CREATE MATERIALIZED VIEW users_by_age AS SELECT age, user_id, x, y, z FROM users PRIMARY KEY (age, user_id) but next we need to add support for WHERE and UDF: CREATE MATERIALIZED VIEW users_by_age AS SELECT age, user_id, substring(phone_number, 3) AS area_code, y, z FROM users WHERE area_code in (512, 513, 514) PRIMARY KEY (age, user_id) Building the view should take advantage of local indexes where applicable. Ideally we would support aggregation in MV as well. Not sure if that is feasible.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Aggregation is probably unfeasible. Having the rest of it would be amazing. +1 to that.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Aggregation is probably unfeasible. Having the rest of it would be amazing. +1 to that.
          Hide
          jbellis Jonathan Ellis added a comment -

          One advantage to MV is that people are somewhat more used to the MV lagging the underlying data. PG goes so far as requiring you to manually issue refresh commands. I think that makes them unusable, so I only mention that to illustrate an extreme – with that precedent, having us say "MV are eventually consistent" sounds quite reasonable!

          (Oracle has had self-updating MV forever. I'm not actually sure what kind of transactionality guarantees you get there.)

          Show
          jbellis Jonathan Ellis added a comment - One advantage to MV is that people are somewhat more used to the MV lagging the underlying data. PG goes so far as requiring you to manually issue refresh commands. I think that makes them unusable, so I only mention that to illustrate an extreme – with that precedent, having us say "MV are eventually consistent" sounds quite reasonable! (Oracle has had self-updating MV forever. I'm not actually sure what kind of transactionality guarantees you get there.)
          Hide
          jjirsa Jeff Jirsa added a comment -

          As an end user, MV + matching SELECT described above looks appealing, and would match what I would expect from hearing the name and being familiar with VIEWs in other databases.

          Show
          jjirsa Jeff Jirsa added a comment - As an end user, MV + matching SELECT described above looks appealing, and would match what I would expect from hearing the name and being familiar with VIEWs in other databases.
          Hide
          jkrupan Jack Krupansky added a comment -

          Oracle has lots of options for the REFRESH clause of the CREATE MATERIALIZED VIEW statement:
          http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm

          Notes on that syntax:
          http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm#i2064161

          Full MV syntax:
          http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm

          You can request that a materialized view be automatically refreshed when the base tables are updated using the "REFRESH ON COMMIT" option. The update transaction pauses while the views are updated - "Specify ON COMMIT to indicate that a fast refresh is to occur whenever the database commits a transaction that operates on a master table of the materialized view. This clause may increase the time taken to complete the commit, because the database performs the refresh operation as part of the commit process."

          You can also refresh on time intervals, on demand, or no refresh ever. Originally MV was known as SNAPSHOT - a one-time snapshot of a view of the base tables/query.

          Oracle has a FAST refresh, which depends on a MATERIALIZED VIEW LOG, which must be created for the base table(s). Otherwise a COMPLETE refresh is required.

          Show
          jkrupan Jack Krupansky added a comment - Oracle has lots of options for the REFRESH clause of the CREATE MATERIALIZED VIEW statement: http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm Notes on that syntax: http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm#i2064161 Full MV syntax: http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm You can request that a materialized view be automatically refreshed when the base tables are updated using the "REFRESH ON COMMIT" option. The update transaction pauses while the views are updated - "Specify ON COMMIT to indicate that a fast refresh is to occur whenever the database commits a transaction that operates on a master table of the materialized view. This clause may increase the time taken to complete the commit, because the database performs the refresh operation as part of the commit process." You can also refresh on time intervals, on demand, or no refresh ever. Originally MV was known as SNAPSHOT - a one-time snapshot of a view of the base tables/query. Oracle has a FAST refresh, which depends on a MATERIALIZED VIEW LOG, which must be created for the base table(s). Otherwise a COMPLETE refresh is required.
          Hide
          jkrupan Jack Krupansky added a comment -

          It would be helpful if someone were to update the description and primary use case(s) for this feature.

          My understanding of the original use case was to avoid the fan out from the coordinator node on an indexed query - the global index would contain the partition keys for matched rows so that only the node(s) containing those partition key(s) would be needed. So, my question at this stage is whether the intention is that the initial cut of MV would include a focus on that performance optimization use case, or merely focus on the increased general flexibility of MV instead. Would the initial implementation of MV even necessarily use a GI? Would local vs. global index be an option to be specified?

          Also, whether it is GI or MV, what guidance will the spec, doc, and training give users as to its performance and scalability? My concern with GI was that it works well for small to medium-sized clusters, but not with very large clusters. So, what would the largest cluster that a user could use a GI for? And also how many GI's make sense. For example, with 1 billion rows per node, and 50 nodes, and a GI on 10 columns, that would be... 1B * 50 * 10 = 500 billion index entries on each node, right? Seems like a bit much for a JVM heap or even off-heap memory. Maybe 500M * 20 * 4 = 40 billion index entries per node would be a wiser upper limit, and even that may be a bit extreme.

          Show
          jkrupan Jack Krupansky added a comment - It would be helpful if someone were to update the description and primary use case(s) for this feature. My understanding of the original use case was to avoid the fan out from the coordinator node on an indexed query - the global index would contain the partition keys for matched rows so that only the node(s) containing those partition key(s) would be needed. So, my question at this stage is whether the intention is that the initial cut of MV would include a focus on that performance optimization use case, or merely focus on the increased general flexibility of MV instead. Would the initial implementation of MV even necessarily use a GI? Would local vs. global index be an option to be specified? Also, whether it is GI or MV, what guidance will the spec, doc, and training give users as to its performance and scalability? My concern with GI was that it works well for small to medium-sized clusters, but not with very large clusters. So, what would the largest cluster that a user could use a GI for? And also how many GI's make sense. For example, with 1 billion rows per node, and 50 nodes, and a GI on 10 columns, that would be... 1B * 50 * 10 = 500 billion index entries on each node, right? Seems like a bit much for a JVM heap or even off-heap memory. Maybe 500M * 20 * 4 = 40 billion index entries per node would be a wiser upper limit, and even that may be a bit extreme.
          Hide
          mbroecheler Matthias Broecheler added a comment -

          I think the discussion around materialized views (which I would love to see in C* at some point) is distracting from what this ticket is really about: closing a hole in the indexing story for C*.

          In RDBMS (and pretty much all other database systems), indexes are used to efficiently retrieve a set of rows identified by their columns values in a particular order at the expense of write performance. By design, C* builds a 100% selectivity index on the primary key. In addition, one can install secondary indexes. Those secondary indexes are useful up to a certain selectivity %. Beyond that threshold, it becomes increasingly more efficient to maintain the index as a global distributed hash map rather than a local index on each node. And that's the hole in the indexing story, because those types of indexes must currently be maintained by the application.

          I am stating the obvious here to point out that the first problem is to provide the infrastructure to create that second class of indexes while ensuring some form of (eventual) consistency. Much like with 2i, once that is in place one can utilize the infrastructure to build other things on top - including materialized views which will need this to begin with (if the primary key of your materialized view has high selectivity).

          As for nomenclature, I agree that "global vs local" index is a technical distinction that has little to no meaning to the user. After all, they want to build an index to get to their data quickly. How that happens is highly secondary. Initially, it might make sense to ask the user to specify the selectivity estimate for the index (defaulting to low) and for C* to pick the best indexing approach based on that. In the future, one could utilize sampled histograms to help the user with that decision.

          Show
          mbroecheler Matthias Broecheler added a comment - I think the discussion around materialized views (which I would love to see in C* at some point) is distracting from what this ticket is really about: closing a hole in the indexing story for C*. In RDBMS (and pretty much all other database systems), indexes are used to efficiently retrieve a set of rows identified by their columns values in a particular order at the expense of write performance. By design, C* builds a 100% selectivity index on the primary key. In addition, one can install secondary indexes. Those secondary indexes are useful up to a certain selectivity %. Beyond that threshold, it becomes increasingly more efficient to maintain the index as a global distributed hash map rather than a local index on each node. And that's the hole in the indexing story, because those types of indexes must currently be maintained by the application. I am stating the obvious here to point out that the first problem is to provide the infrastructure to create that second class of indexes while ensuring some form of (eventual) consistency. Much like with 2i, once that is in place one can utilize the infrastructure to build other things on top - including materialized views which will need this to begin with (if the primary key of your materialized view has high selectivity). As for nomenclature, I agree that "global vs local" index is a technical distinction that has little to no meaning to the user. After all, they want to build an index to get to their data quickly. How that happens is highly secondary. Initially, it might make sense to ask the user to specify the selectivity estimate for the index (defaulting to low) and for C* to pick the best indexing approach based on that. In the future, one could utilize sampled histograms to help the user with that decision.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Much like with 2i, once that is in place one can utilize the infrastructure to build other things on top - including materialized views which will need this to begin with (if the primary key of your materialized view has high selectivity).

          Sorry for a short reply. Just wanted to note that I see this being the other way around. Full-on materialized views are the infrastructure for 'global indexes', not the other way around.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Much like with 2i, once that is in place one can utilize the infrastructure to build other things on top - including materialized views which will need this to begin with (if the primary key of your materialized view has high selectivity). Sorry for a short reply. Just wanted to note that I see this being the other way around. Full-on materialized views are the infrastructure for 'global indexes', not the other way around.
          Hide
          mbroecheler Matthias Broecheler added a comment -

          Sorry for a short reply. Just wanted to note that I see this being the other way around. Full-on materialized views are the infrastructure for 'global indexes', not the other way around.

          Do you mean that from an implementation perspective? Conceptually, I would argue it the other way around. When you build support for full-on materialized views, you still have to deal with the decision: is the primary key high or low selectivity, i.e. do I maintain this view locally or globally. And then, depending on that decision, you would utilize the 2i infrastructure (or something like it) or some other infrastructure for global maintenance of derived records.

          Plus, with full-on materialized views you are mixing in other challenges, like dealing with multiple columns (e.g. CASSANDRA-5402).
          So it seems, conceptually at least, that the initial atomic problem seems to be how to consistently maintain derived row records on remote nodes. That would be enough to build global indexes on a single column. From there, this could be extended to full-on materialized view maintenance.

          Show
          mbroecheler Matthias Broecheler added a comment - Sorry for a short reply. Just wanted to note that I see this being the other way around. Full-on materialized views are the infrastructure for 'global indexes', not the other way around. Do you mean that from an implementation perspective? Conceptually, I would argue it the other way around. When you build support for full-on materialized views, you still have to deal with the decision: is the primary key high or low selectivity, i.e. do I maintain this view locally or globally. And then, depending on that decision, you would utilize the 2i infrastructure (or something like it) or some other infrastructure for global maintenance of derived records. Plus, with full-on materialized views you are mixing in other challenges, like dealing with multiple columns (e.g. CASSANDRA-5402 ). So it seems, conceptually at least, that the initial atomic problem seems to be how to consistently maintain derived row records on remote nodes. That would be enough to build global indexes on a single column. From there, this could be extended to full-on materialized view maintenance.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Do you mean that from an implementation perspective?

          Yes.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Do you mean that from an implementation perspective? Yes.
          Hide
          carlyeks Carl Yeksigian added a comment -

          At NGCC, there was some discussion about how we are going to handle eventual consistency in this design.

          Benedict's proposal earlier in this ticket, where each data replica performs a local read-before-write and pushes out the value to the index replicas is the one which does provide eventual consistency guarantees in line with user's expectations. In order to make sure that the values are consistent in the GI, it will also need to use a batch log at quorum to ensure that the update eventually gets applied.

          Here is the process that GI will take on mutation:

          • If the mutations will be affected by global indexes (either the target or an included column), coordinator creates batch log which will wait for a quorum to ack mutations before being deleted
          • Coordinator pushes mutations out to data replicas
          • Data replica checks if mutation touches global index
          • Data replica takes lock on global index row in order to make sure that the generated mutations are consistent
          • Data replica creates index mutation, blocks until that is complete
          • Data replica applies data mutation, acks mutation

          The ordering of the index mutation before the data mutation means that if we fail at index mutation, we will generate the same values the next time this mutations is applied.

          We will send to exactly one replica, whose position in the ring relative to the token is equidistant as ours is.

          Show
          carlyeks Carl Yeksigian added a comment - At NGCC, there was some discussion about how we are going to handle eventual consistency in this design. Benedict's proposal earlier in this ticket, where each data replica performs a local read-before-write and pushes out the value to the index replicas is the one which does provide eventual consistency guarantees in line with user's expectations. In order to make sure that the values are consistent in the GI, it will also need to use a batch log at quorum to ensure that the update eventually gets applied. Here is the process that GI will take on mutation: If the mutations will be affected by global indexes (either the target or an included column), coordinator creates batch log which will wait for a quorum to ack mutations before being deleted Coordinator pushes mutations out to data replicas Data replica checks if mutation touches global index Data replica takes lock on global index row in order to make sure that the generated mutations are consistent Data replica creates index mutation, blocks until that is complete Data replica applies data mutation, acks mutation The ordering of the index mutation before the data mutation means that if we fail at index mutation, we will generate the same values the next time this mutations is applied. We will send to exactly one replica, whose position in the ring relative to the token is equidistant as ours is.
          Hide
          jbellis Jonathan Ellis added a comment -

          A couple questions:

          1. What is the thinking behind using a customized batchlog vs the existing one?
          2. Is each replica making GI updates, or just one?
          3. What do we do to GI on repair of the base data table?
          Show
          jbellis Jonathan Ellis added a comment - A couple questions: What is the thinking behind using a customized batchlog vs the existing one? Is each replica making GI updates, or just one? What do we do to GI on repair of the base data table?
          Hide
          carlyeks Carl Yeksigian added a comment - - edited
          1. It is going to be the same mechanism, but we don't want to use the same consistency as what the insert is. This way, we can ensure that at least one node has seen all of the updates, and thus we can generate the correct tombstone based on the previous values; we are trying to make the dependency between the data table and the index table redundant, so we need to make sure a quorum is involved in the write
          2. Each replica makes a GI update independently, based on the data that it has, which means that we might issue updates for an older update that hasn't made it to all of the replicas yet. To cut down on the amount of work that the indexes do, a pretty easy optimization is to just send the index mutation to the index replica that the data node will wait on instead of sending them to all of the index replicas
          3. If we ever get into a situation where we have data loss in either the base table or the index table (both would likely go together), we would really need to run a rebuild, since there is no guarantee that extra data wouldn't be present in the index which isn't in the data table. Otherwise, we can repair the data and index tables independently, so that a repair issued on the data table should also repair all of the global index tables
          Show
          carlyeks Carl Yeksigian added a comment - - edited It is going to be the same mechanism, but we don't want to use the same consistency as what the insert is. This way, we can ensure that at least one node has seen all of the updates, and thus we can generate the correct tombstone based on the previous values; we are trying to make the dependency between the data table and the index table redundant, so we need to make sure a quorum is involved in the write Each replica makes a GI update independently, based on the data that it has, which means that we might issue updates for an older update that hasn't made it to all of the replicas yet. To cut down on the amount of work that the indexes do, a pretty easy optimization is to just send the index mutation to the index replica that the data node will wait on instead of sending them to all of the index replicas If we ever get into a situation where we have data loss in either the base table or the index table (both would likely go together), we would really need to run a rebuild, since there is no guarantee that extra data wouldn't be present in the index which isn't in the data table. Otherwise, we can repair the data and index tables independently, so that a repair issued on the data table should also repair all of the global index tables
          Hide
          slebresne Sylvain Lebresne added a comment -

          I'm warming up to the idea of calling it MV but only IF we're committed to fleshing it out to match SELECT.

          +1. On top of having a clear path for added flexibility, having the underlying table being directly exposed make it clear that you can only access what you denormalize (this is obvious for MV, less with a GI) and simplify the read path (the CQL parts of the read path).

          Show
          slebresne Sylvain Lebresne added a comment - I'm warming up to the idea of calling it MV but only IF we're committed to fleshing it out to match SELECT. +1. On top of having a clear path for added flexibility, having the underlying table being directly exposed make it clear that you can only access what you denormalize (this is obvious for MV, less with a GI) and simplify the read path (the CQL parts of the read path).
          Hide
          jbellis Jonathan Ellis added a comment -

          It also means you don't give up token-aware routing (CASSANDRA-8517).

          Show
          jbellis Jonathan Ellis added a comment - It also means you don't give up token-aware routing ( CASSANDRA-8517 ).
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          having the underlying table being directly exposed make it clear that you can only access what you denormalize (this is obvious for MV, less with a GI) and simplify the read path (the CQL parts of the read path).

          That's one of the main reasons I want this to be fully MV, too. So, reiterating my +1.

          Show
          iamaleksey Aleksey Yeschenko added a comment - having the underlying table being directly exposed make it clear that you can only access what you denormalize (this is obvious for MV, less with a GI) and simplify the read path (the CQL parts of the read path). That's one of the main reasons I want this to be fully MV, too. So, reiterating my +1.
          Hide
          jkrupan Jack Krupansky added a comment -

          Still waiting for an updated description for the ticket. In particular, what specific use cases is this feature designed to handle well, and is it definitively expert-only, or will there be use cases that are safe for normal users. The key thing (ha ha!) is whether this feature will provide capabilities to make it much easier for people to migrate from SQL to Cassandra in terms of the denormalization process, and do it in a way that people can pick up easily in Data Modeling 101 training. A couple of examples would help a lot - like test cases.

          Show
          jkrupan Jack Krupansky added a comment - Still waiting for an updated description for the ticket. In particular, what specific use cases is this feature designed to handle well, and is it definitively expert-only, or will there be use cases that are safe for normal users. The key thing (ha ha!) is whether this feature will provide capabilities to make it much easier for people to migrate from SQL to Cassandra in terms of the denormalization process, and do it in a way that people can pick up easily in Data Modeling 101 training. A couple of examples would help a lot - like test cases.
          Hide
          philipthompson Philip Thompson added a comment -

          We have some basic test cases written, but now that the plan is to do full MV instead of GI, those are all pretty stale, and probably won't help you.

          Show
          philipthompson Philip Thompson added a comment - We have some basic test cases written, but now that the plan is to do full MV instead of GI, those are all pretty stale, and probably won't help you.
          Hide
          jkrupan Jack Krupansky added a comment -

          Is it fair to say that the primary technique for using this feature is to have one base table and n views of that table, each with a different selection of the base columns as the primary key of the view, with all rows selected, possibly projected differently but with different keys?

          Would it also be sensible to select a subset of rows? Although that might confuse some users who might think it would give them sophisticated ad hoc queries when in fact the query column values are fixed. For example, select all rows for a specific state. In this way, it doesn't offer what a global index would offer.

          Show
          jkrupan Jack Krupansky added a comment - Is it fair to say that the primary technique for using this feature is to have one base table and n views of that table, each with a different selection of the base columns as the primary key of the view, with all rows selected, possibly projected differently but with different keys? Would it also be sensible to select a subset of rows? Although that might confuse some users who might think it would give them sophisticated ad hoc queries when in fact the query column values are fixed. For example, select all rows for a specific state. In this way, it doesn't offer what a global index would offer.
          Hide
          jkrupan Jack Krupansky added a comment -

          Back to the original description, will the revised MV purpose address the high cardinality issue? That may depend on what guidance the spec offers for how data modelers should set up the primary key columns in terms of partition (or routing!) columns vs. clustering columns.

          Is the basic concept that although the selected rows might be scattered across multiple nodes in the base table, the goal is that they would cluster together on a single node for the MV table based on careful specification of partition key columns in the MV?

          Show
          jkrupan Jack Krupansky added a comment - Back to the original description, will the revised MV purpose address the high cardinality issue? That may depend on what guidance the spec offers for how data modelers should set up the primary key columns in terms of partition (or routing!) columns vs. clustering columns. Is the basic concept that although the selected rows might be scattered across multiple nodes in the base table, the goal is that they would cluster together on a single node for the MV table based on careful specification of partition key columns in the MV?
          Hide
          carlyeks Carl Yeksigian added a comment -

          MV allow us to route queries for high-cardinality non-partition columns so that they are single node queries rather than having to fan-out on read. The reason this was renamed is that we want to prevent a fan-out after we get back the set of keys, so we store the data that we want to query with the keys, thus creating something closer to MV than global indexes. In addition, we want to make sure that users understand why they can't select arbitrary columns from the table; they will only be able to use the columns they specify when creating the MV.

          We also want to take something that is done by many different users currently, which is denormalizing the data into new tables, and bring that in so that we can make sure that the edge cases are handled properly, and make it easier for users to model their data.

          Is it fair to say that the primary technique for using this feature is to have one base table and n views of that table

          Yes, as many tables as the users currently create and use themselves, they should be able to create materialized views for.

          Would it also be sensible to select a subset of rows?

          This won't be in the initial version, but it is something that we will probably add on later, along with richer select statement support

          Show
          carlyeks Carl Yeksigian added a comment - MV allow us to route queries for high-cardinality non-partition columns so that they are single node queries rather than having to fan-out on read. The reason this was renamed is that we want to prevent a fan-out after we get back the set of keys, so we store the data that we want to query with the keys, thus creating something closer to MV than global indexes. In addition, we want to make sure that users understand why they can't select arbitrary columns from the table; they will only be able to use the columns they specify when creating the MV. We also want to take something that is done by many different users currently, which is denormalizing the data into new tables, and bring that in so that we can make sure that the edge cases are handled properly, and make it easier for users to model their data. Is it fair to say that the primary technique for using this feature is to have one base table and n views of that table Yes, as many tables as the users currently create and use themselves, they should be able to create materialized views for. Would it also be sensible to select a subset of rows? This won't be in the initial version, but it is something that we will probably add on later, along with richer select statement support
          Hide
          jkrupan Jack Krupansky added a comment -

          When exactly would population of the MV occur? What refresh options would initially be supported? Would population/refresh begin instantly when the MV is created, by default, or would an explicit command be required to begin population? Earlier I linked to the Oracle doc on MV, so a comparison to Oracle for refresh options might be nice, especially for users migrating from Oracle. Where would the state of refresh be stored, and how can a user monitor it? On each node of the base table?

          PostgreSQL doesn't seem to have as many options:
          http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html

          With RF>1, which of the nodes containing a given token would push an update to the MV? All of them? Presumably the push can be token-aware, so that each push only goes to RF=n nodes based on the PK of the MV insert row. Would a consistency level be warranted for the push? Would there be hints as well? And repair of an MV if the rate of updates of the base table overwhelms the update bandwidth of the (many) MVs for the base table?

          Any thoughts on throttling of the flow of updates from other nodes so that population of a MV does not overwhelm or interfere with normal cluster operation? What default, and what override? What would be a reasonable default, and what would be best practice advice for a maximum?

          Show
          jkrupan Jack Krupansky added a comment - When exactly would population of the MV occur? What refresh options would initially be supported? Would population/refresh begin instantly when the MV is created, by default, or would an explicit command be required to begin population? Earlier I linked to the Oracle doc on MV, so a comparison to Oracle for refresh options might be nice, especially for users migrating from Oracle. Where would the state of refresh be stored, and how can a user monitor it? On each node of the base table? PostgreSQL doesn't seem to have as many options: http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html With RF>1, which of the nodes containing a given token would push an update to the MV? All of them? Presumably the push can be token-aware, so that each push only goes to RF=n nodes based on the PK of the MV insert row. Would a consistency level be warranted for the push? Would there be hints as well? And repair of an MV if the rate of updates of the base table overwhelms the update bandwidth of the (many) MVs for the base table? Any thoughts on throttling of the flow of updates from other nodes so that population of a MV does not overwhelm or interfere with normal cluster operation? What default, and what override? What would be a reasonable default, and what would be best practice advice for a maximum?
          Hide
          brianmhess Brian Hess added a comment -

          I really don't need to re-ignite the nomenclature debate, but another name that might be more fitting with how we've talked about these tables in the past is DB2's name: Materialized Query Tables
          Anyone like or hate that name more?

          Show
          brianmhess Brian Hess added a comment - I really don't need to re-ignite the nomenclature debate, but another name that might be more fitting with how we've talked about these tables in the past is DB2's name: Materialized Query Tables Anyone like or hate that name more?
          Hide
          rustyrazorblade Jon Haddad added a comment - - edited

          Not a fan of materialized query tables personally. We frequently refer to the various tables that people need to create to query their data as materialized views, and people seem to understand immediately.

          Show
          rustyrazorblade Jon Haddad added a comment - - edited Not a fan of materialized query tables personally. We frequently refer to the various tables that people need to create to query their data as materialized views, and people seem to understand immediately.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Pushed a new branch which includes the materialized view syntax Jonathan Ellis proposed:

          CREATE MATERIALIZED VIEW users_by_age AS
          SELECT age, user_id, x, y, z FROM users
          PRIMARY KEY (age, user_id)
          

          The branch also includes the consistency updates proposed earlier in this ticket, as well as initial support for collection.

          Main outstanding issues:

          • When overloaded with inserts to a single partition key in the base, tombstones aren't always properly generated for the view; this inconsistency doesn't get corrected with more time since repair doesn't check view versus base
          • An insert to a collection produces a range tombstone which isn't translated properly against the view; only range tombstones covering the whole row are handled
          Show
          carlyeks Carl Yeksigian added a comment - Pushed a new branch which includes the materialized view syntax Jonathan Ellis proposed: CREATE MATERIALIZED VIEW users_by_age AS SELECT age, user_id, x, y, z FROM users PRIMARY KEY (age, user_id) The branch also includes the consistency updates proposed earlier in this ticket, as well as initial support for collection. Main outstanding issues: When overloaded with inserts to a single partition key in the base, tombstones aren't always properly generated for the view; this inconsistency doesn't get corrected with more time since repair doesn't check view versus base An insert to a collection produces a range tombstone which isn't translated properly against the view; only range tombstones covering the whole row are handled
          Hide
          jbellis Jonathan Ellis added a comment -

          When overloaded with inserts to a single partition key in the base, tombstones aren't always properly generated for the view

          Because of a race on index maintenance, or because we're load shedding?

          Show
          jbellis Jonathan Ellis added a comment - When overloaded with inserts to a single partition key in the base, tombstones aren't always properly generated for the view Because of a race on index maintenance, or because we're load shedding?
          Hide
          carlyeks Carl Yeksigian added a comment -

          Because of a race on index maintenance, or because we're load shedding?

          It's a race - the base mutation shouldn't be applied if the view mutations (and tombstones) aren't. It is supposed to block until the view mutation is applied before applying the base mutation; a lock is being held by the mutating thread, so we shouldn't be doing other mutations at the same time.

          Show
          carlyeks Carl Yeksigian added a comment - Because of a race on index maintenance, or because we're load shedding? It's a race - the base mutation shouldn't be applied if the view mutations (and tombstones) aren't. It is supposed to block until the view mutation is applied before applying the base mutation; a lock is being held by the mutating thread, so we shouldn't be doing other mutations at the same time.
          Hide
          brianmhess Brian Hess added a comment -

          Can someone clarify whether, based on the current implementation, the Materialized View (or some of the Materialized Views, if there are more than one) can contain the mutation but the base table does not - or vice versa? Can you clarify what happens in each of the failure scenarios? It will be very important for us to know for sure when we discuss this with users/customers.

          Show
          brianmhess Brian Hess added a comment - Can someone clarify whether, based on the current implementation, the Materialized View (or some of the Materialized Views, if there are more than one) can contain the mutation but the base table does not - or vice versa? Can you clarify what happens in each of the failure scenarios? It will be very important for us to know for sure when we discuss this with users/customers.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Can someone clarify whether, based on the current implementation, the Materialized View (or some of the Materialized Views, if there are more than one) can contain the mutation but the base table does not - or vice versa?

          That's something that is still under development, hence the comments above that it is an outstanding issue. We're trying very hard to ensure that the materialized view that we ship is very unlikely to have a mutation in the base table which isn't in the view.

          Show
          carlyeks Carl Yeksigian added a comment - Can someone clarify whether, based on the current implementation, the Materialized View (or some of the Materialized Views, if there are more than one) can contain the mutation but the base table does not - or vice versa? That's something that is still under development, hence the comments above that it is an outstanding issue. We're trying very hard to ensure that the materialized view that we ship is very unlikely to have a mutation in the base table which isn't in the view.
          Hide
          jkrupan Jack Krupansky added a comment -

          1. Has a decision been made on refresh modes? It sounds like the focus is on "always consistent", as opposed to manual refresh or one-time without refresh or on some time interval, but is that simply the default, preferred refresh mode, or the only mode that will be available (initially)?

          2. What happens if an MV is created for a base table that is already populated? Will the operation block while all existing data is propagated to the MV, or will that propagation happen in the background (in which case, is there a way to monitor its status and completion?), or is that not supported (initially)?

          Show
          jkrupan Jack Krupansky added a comment - 1. Has a decision been made on refresh modes? It sounds like the focus is on "always consistent", as opposed to manual refresh or one-time without refresh or on some time interval, but is that simply the default, preferred refresh mode, or the only mode that will be available (initially)? 2. What happens if an MV is created for a base table that is already populated? Will the operation block while all existing data is propagated to the MV, or will that propagation happen in the background (in which case, is there a way to monitor its status and completion?), or is that not supported (initially)?
          Hide
          carlyeks Carl Yeksigian added a comment -

          1. No decision has been made on whether other refresh modes will be added, but the focus has only been on an eventually consistent mode.
          2. The build happens in the background. In the under development branch, there is no way to monitor the progress other than looking at the table on each node which stores the MV build progress.

          Show
          carlyeks Carl Yeksigian added a comment - 1. No decision has been made on whether other refresh modes will be added, but the focus has only been on an eventually consistent mode. 2. The build happens in the background. In the under development branch, there is no way to monitor the progress other than looking at the table on each node which stores the MV build progress.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Update on this ticket:

          The consistency issue was caused by not using the batch log for the materialized view replica mutations. This meant that we weren't guaranteed that the materialized view would get all of the updates, including tombstones, which is why there were additional entries in the MV.

          Adding this fixed the consistency issue, but lengthened the time to execute a materialized view mutation, which exacerbated another issue: lock contention for the partition. Since we need to make sure that the MV and the base table are not updated until we have finished figuring out the new mutations; trying to acquire that lock also backs up the mutation stage and causes the batchlog mutations to fail to run. Much like counters has its own stage, we've move MV mutations to their own stage, as well as batchlog mutations. We also made the tasks reschedule themselves in the stage if they are waiting for a lock and cannot acquire it, so that the stage doesn't deadlock.

          These changes have improved MV so that we can overwrite a single key many times and get a consistent result back at the end of the run. We have a dtest which tests that scenario.

          Collections included in the view are properly handled now as well, so the range tombstones are created for the view properly; all of the types have now been verified by a unit test.

          The next steps are to rebase this on top of 8099 for 3.0.

          Many thanks to T Jake Luciani for all of his help looking into these problems.

          Show
          carlyeks Carl Yeksigian added a comment - Update on this ticket: The consistency issue was caused by not using the batch log for the materialized view replica mutations. This meant that we weren't guaranteed that the materialized view would get all of the updates, including tombstones, which is why there were additional entries in the MV. Adding this fixed the consistency issue, but lengthened the time to execute a materialized view mutation, which exacerbated another issue: lock contention for the partition. Since we need to make sure that the MV and the base table are not updated until we have finished figuring out the new mutations; trying to acquire that lock also backs up the mutation stage and causes the batchlog mutations to fail to run. Much like counters has its own stage, we've move MV mutations to their own stage, as well as batchlog mutations. We also made the tasks reschedule themselves in the stage if they are waiting for a lock and cannot acquire it, so that the stage doesn't deadlock. These changes have improved MV so that we can overwrite a single key many times and get a consistent result back at the end of the run. We have a dtest which tests that scenario. Collections included in the view are properly handled now as well, so the range tombstones are created for the view properly; all of the types have now been verified by a unit test. The next steps are to rebase this on top of 8099 for 3.0. Many thanks to T Jake Luciani for all of his help looking into these problems.
          Hide
          jbellis Jonathan Ellis added a comment -

          > We also made the tasks reschedule themselves in the stage if they are waiting for a lock and cannot acquire it, so that the stage doesn't deadlock.

          Do you mean "so the stage can make progress on other updates while waiting" or do you mean actual deadlock?

          Show
          jbellis Jonathan Ellis added a comment - > We also made the tasks reschedule themselves in the stage if they are waiting for a lock and cannot acquire it, so that the stage doesn't deadlock. Do you mean "so the stage can make progress on other updates while waiting" or do you mean actual deadlock?
          Hide
          tjake T Jake Luciani added a comment -

          Do you mean "so the stage can make progress on other updates while waiting" or do you mean actual deadlock?

          Correct, so other mutations can make progress. If you filled up the mutation queue with mutations on the same key you would virtually deadlock all others.

          Show
          tjake T Jake Luciani added a comment - Do you mean "so the stage can make progress on other updates while waiting" or do you mean actual deadlock? Correct, so other mutations can make progress. If you filled up the mutation queue with mutations on the same key you would virtually deadlock all others.
          Hide
          tjake T Jake Luciani added a comment -

          One minor requirement that wasn't spelled out is support for composite primary keys in the MV table.

          Show
          tjake T Jake Luciani added a comment - One minor requirement that wasn't spelled out is support for composite primary keys in the MV table.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Branch which is based off 8099 branch.

          T Jake Luciani has also added support for composite partition keys in the MV table.

          Joshua McKenzie- this is now ready to start review.

          Show
          carlyeks Carl Yeksigian added a comment - Branch which is based off 8099 branch . T Jake Luciani has also added support for composite partition keys in the MV table. Joshua McKenzie - this is now ready to start review.
          Hide
          aboudreault Alan Boudreault added a comment - - edited

          Carl Yeksigian Here is a small script to reproduce the issue discussed on hipchat: test-view-data.sh . Give a try to that script and let me know if you can also reproduce the issue. It is possible that you have to run the script 2-3 times before. I noticed that sometime, the count is OK.

          For the record, the issue is that my materialized view is not synchronized properly with the data in my main table. There is no hints or batchlog recorded.

          Output of the issue:

          MAIN TABLE COUNT
          Consistency level set to ALL.
          
           count
          -------
               8
          
          (1 rows)
          
          MV TABLE COUNT - node1
          Consistency level set to ALL.
          
           count
          -------
               7
          
          (1 rows)
          \MAIN TABLE DATA
          Consistency level set to ALL.
          
           id | v
          ----+---
            5 | 5
            1 | 1
            8 | 8
            2 | 2
            4 | 4
            7 | 7
            6 | 6
            3 | 3
          
          (8 rows)
          
          MV TABLE DATA
          Consistency level set to ALL.
          
           v | id
          ---+----
           1 |  1
           8 |  8
           2 |  2
           4 |  4
           7 |  7
           6 |  6
           3 |  3
          
          (7 rows)
          
          
          Show
          aboudreault Alan Boudreault added a comment - - edited Carl Yeksigian Here is a small script to reproduce the issue discussed on hipchat: test-view-data.sh . Give a try to that script and let me know if you can also reproduce the issue. It is possible that you have to run the script 2-3 times before. I noticed that sometime, the count is OK. For the record, the issue is that my materialized view is not synchronized properly with the data in my main table. There is no hints or batchlog recorded. Output of the issue: MAIN TABLE COUNT Consistency level set to ALL. count ------- 8 (1 rows) MV TABLE COUNT - node1 Consistency level set to ALL. count ------- 7 (1 rows) \MAIN TABLE DATA Consistency level set to ALL. id | v ----+--- 5 | 5 1 | 1 8 | 8 2 | 2 4 | 4 7 | 7 6 | 6 3 | 3 (8 rows) MV TABLE DATA Consistency level set to ALL. v | id ---+---- 1 | 1 8 | 8 2 | 2 4 | 4 7 | 7 6 | 6 3 | 3 (7 rows)
          Hide
          aboudreault Alan Boudreault added a comment -

          While testing, I noticed that if we drop a column that is used by a materialized view (PK), the view is dropped silently. It looks like it's the desired behavior since a log entry is written saying: " MigrationManager.java:381 - Drop table 'ks/users_by_state'".

          I just wanted to raise a suggestion: would it better (and possible) to force the user to delete its materialize view explicitly before doing this operation, rather than dropping the MV silently? Not a strong opinion here, I'm just thinking that this could be annoying for users to notice some MV disappeared.

          Show
          aboudreault Alan Boudreault added a comment - While testing, I noticed that if we drop a column that is used by a materialized view (PK), the view is dropped silently. It looks like it's the desired behavior since a log entry is written saying: " MigrationManager.java:381 - Drop table 'ks/users_by_state'". I just wanted to raise a suggestion: would it better (and possible) to force the user to delete its materialize view explicitly before doing this operation, rather than dropping the MV silently? Not a strong opinion here, I'm just thinking that this could be annoying for users to notice some MV disappeared.
          Hide
          carlyeks Carl Yeksigian added a comment -

          This is mainly historic from when it was going to be a global index, since 2i doesn't require dropping the indexes first. I'd be in favor of changed it back to not allowing dropping columns which are part of the materialized view. Same would apply to dropping a table which has materialized views on them.

          Looking at the new node issue now.

          Show
          carlyeks Carl Yeksigian added a comment - This is mainly historic from when it was going to be a global index, since 2i doesn't require dropping the indexes first. I'd be in favor of changed it back to not allowing dropping columns which are part of the materialized view. Same would apply to dropping a table which has materialized views on them. Looking at the new node issue now.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          +1 on the "don't allow dropping columns that are part of a MV". While we're not an RDBMS, the parallel to treatment of FK's on a relational schema and dropping tables makes sense and should be the least surprising thing for users.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - +1 on the "don't allow dropping columns that are part of a MV". While we're not an RDBMS, the parallel to treatment of FK's on a relational schema and dropping tables makes sense and should be the least surprising thing for users.
          Hide
          aboudreault Alan Boudreault added a comment -

          +1 to the drop table too. make sense.

          Show
          aboudreault Alan Boudreault added a comment - +1 to the drop table too. make sense.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          That seems a bit too strict. I don't see why we wouldn't allow dropping a column that's not in a primary key in any of the MVs.

          Show
          iamaleksey Aleksey Yeschenko added a comment - That seems a bit too strict. I don't see why we wouldn't allow dropping a column that's not in a primary key in any of the MVs.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          I think the problem comes in when you have a complex data model with transparent views of a root data source; I could easily see someone dropping a column without realizing there was a specific MV w/that data in it that really needed it.

          A decent alternative may be introducing the CASCADES keyword to indicate that you're ok with it taking out the MV columns, similar to postgres and others.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - I think the problem comes in when you have a complex data model with transparent views of a root data source; I could easily see someone dropping a column without realizing there was a specific MV w/that data in it that really needed it. A decent alternative may be introducing the CASCADES keyword to indicate that you're ok with it taking out the MV columns, similar to postgres and others.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Do we have ALTER MV in this patch? I guess if we cannot (yet) remove columns from an MV, then the restriction on the base table makes sense, temporarily.

          Show
          iamaleksey Aleksey Yeschenko added a comment - Do we have ALTER MV in this patch? I guess if we cannot (yet) remove columns from an MV, then the restriction on the base table makes sense, temporarily.
          Hide
          carlyeks Carl Yeksigian added a comment -

          There is no ALTER MV in the patch, but I've created an issue for it (CASSANDRA-9736). I'll make the changes so that you can't drop a column or a table depended on by a MV.

          Show
          carlyeks Carl Yeksigian added a comment - There is no ALTER MV in the patch, but I've created an issue for it ( CASSANDRA-9736 ). I'll make the changes so that you can't drop a column or a table depended on by a MV.
          Hide
          chbatey Christopher Batey added a comment - - edited

          I got a lot of these in my log when playing with the branch, however the views I created worked fine.

          WARN  [CompactionExecutor:111] 2015-07-07 10:57:09,023 MaterializedViewBuilder.java:189 - Materialized View failed to complete, sleeping 5 minutes before restarting
          org.apache.cassandra.exceptions.InvalidRequestException: Missing mandatory PRIMARY KEY part host_id
                  at org.apache.cassandra.cql3.statements.RequestValidations.invalidRequest(RequestValidations.java:199) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.RequestValidations.checkTrue(RequestValidations.java:63) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.RequestValidations.checkNotNull(RequestValidations.java:140) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.ModificationStatement.buildPartitionKeyNames(ModificationStatement.java:395) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.ModificationStatement.getMutations(ModificationStatement.java:752) ~[main/:na]
                  at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:727) ~[main/:na]
                  at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:293) ~[main/:na]
                  at org.apache.cassandra.db.SystemKeyspace.setMaterializedViewBuilt(SystemKeyspace.java:453) ~[main/:na]
                  at org.apache.cassandra.db.SystemKeyspace.finishMaterializedViewBuildStatus(SystemKeyspace.java:472) ~[main/:na]
                  at org.apache.cassandra.db.view.MaterializedViewBuilder.run(MaterializedViewBuilder.java:174) ~[main/:na]
                  at org.apache.cassandra.db.compaction.CompactionManager$13.run(CompactionManager.java:1364) [main/:na]
          
          Show
          chbatey Christopher Batey added a comment - - edited I got a lot of these in my log when playing with the branch, however the views I created worked fine. WARN [CompactionExecutor:111] 2015-07-07 10:57:09,023 MaterializedViewBuilder.java:189 - Materialized View failed to complete, sleeping 5 minutes before restarting org.apache.cassandra.exceptions.InvalidRequestException: Missing mandatory PRIMARY KEY part host_id at org.apache.cassandra.cql3.statements.RequestValidations.invalidRequest(RequestValidations.java:199) ~[main/:na] at org.apache.cassandra.cql3.statements.RequestValidations.checkTrue(RequestValidations.java:63) ~[main/:na] at org.apache.cassandra.cql3.statements.RequestValidations.checkNotNull(RequestValidations.java:140) ~[main/:na] at org.apache.cassandra.cql3.statements.ModificationStatement.buildPartitionKeyNames(ModificationStatement.java:395) ~[main/:na] at org.apache.cassandra.cql3.statements.ModificationStatement.getMutations(ModificationStatement.java:752) ~[main/:na] at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:727) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:293) ~[main/:na] at org.apache.cassandra.db.SystemKeyspace.setMaterializedViewBuilt(SystemKeyspace.java:453) ~[main/:na] at org.apache.cassandra.db.SystemKeyspace.finishMaterializedViewBuildStatus(SystemKeyspace.java:472) ~[main/:na] at org.apache.cassandra.db.view.MaterializedViewBuilder.run(MaterializedViewBuilder.java:174) ~[main/:na] at org.apache.cassandra.db.compaction.CompactionManager$13.run(CompactionManager.java:1364) [main/:na]
          Hide
          carlyeks Carl Yeksigian added a comment -

          Christopher Batey Thanks for trying out the materialized views branch! I've pushed up a rebased branch against trunk (after 8099 was integrated) and have fixed the issue you found.

          Show
          carlyeks Carl Yeksigian added a comment - Christopher Batey Thanks for trying out the materialized views branch! I've pushed up a rebased branch against trunk (after 8099 was integrated) and have fixed the issue you found.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Do we have any performance (cstar) numbers yet?

          Show
          iamaleksey Aleksey Yeschenko added a comment - Do we have any performance (cstar) numbers yet?
          Hide
          aboudreault Alan Boudreault added a comment -

          Is it related to the issue I found when adding a node?

          Show
          aboudreault Alan Boudreault added a comment - Is it related to the issue I found when adding a node?
          Hide
          aboudreault Alan Boudreault added a comment -

          I haven't done any performance tests at the moment. Carl Yeksigian, have you done some? If not, I guess I can do that this week.

          Show
          aboudreault Alan Boudreault added a comment - I haven't done any performance tests at the moment. Carl Yeksigian , have you done some? If not, I guess I can do that this week.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Alan Boudreault: This was an unrelated issue - it was related to keeping track whether the MV was built or not; I'm still looking at what is causing the inconsistency when a new node has been added. I have only seen it occur once in my testing, but it would be good to catch.

          Aleksey Yeschenko: I haven't done any cstar testing yet; have only done local stress. Would be good to run that, though I don't think we can run it through the web interface as we'll need to create the MV. /cc Ryan McGuire

          Show
          carlyeks Carl Yeksigian added a comment - Alan Boudreault : This was an unrelated issue - it was related to keeping track whether the MV was built or not; I'm still looking at what is causing the inconsistency when a new node has been added. I have only seen it occur once in my testing, but it would be good to catch. Aleksey Yeschenko : I haven't done any cstar testing yet; have only done local stress. Would be good to run that, though I don't think we can run it through the web interface as we'll need to create the MV. /cc Ryan McGuire
          Hide
          jkrupan Jack Krupansky added a comment -

          1. Are MV updates still eventually consistent (not guaranteed)?

          2. Is there any way for the app to assure that the MV update have been completed to some desired CL?

          3. Will a repair to the base table assure that all MV are consistent?

          4. Can a single MV be repaired to assure that it is consistent? (Especially since the data for a MV on a node will be derived from data on other nodes due to differences in the partition keys.)

          Great to see such an exciting new feature take shape!

          Show
          jkrupan Jack Krupansky added a comment - 1. Are MV updates still eventually consistent (not guaranteed)? 2. Is there any way for the app to assure that the MV update have been completed to some desired CL? 3. Will a repair to the base table assure that all MV are consistent? 4. Can a single MV be repaired to assure that it is consistent? (Especially since the data for a MV on a node will be derived from data on other nodes due to differences in the partition keys.) Great to see such an exciting new feature take shape!
          Hide
          carlyeks Carl Yeksigian added a comment -
          1. MV use the batchlog in order to provide eventual consistency
          2. MV updates follow the same CL as the base update has
          3. No, repairs to the base table only assure that the base table is consistent; same with a repair on a MV

          In the patch, there is no way to do a repair between the MV and the base table other than dropping and recreating the view.

          Show
          carlyeks Carl Yeksigian added a comment - MV use the batchlog in order to provide eventual consistency MV updates follow the same CL as the base update has No, repairs to the base table only assure that the base table is consistent; same with a repair on a MV In the patch, there is no way to do a repair between the MV and the base table other than dropping and recreating the view.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          In the patch, there is no way to do a repair between the MV and the base table other than dropping and recreating the view.

          Would this be worth a follow-up ticket, or would the repair overhead / time frame be long enough that it makes sense just to recommend people drop/recreate MV after repair? If so, would it make sense to bolt that functionality as an option onto the repair process?

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - In the patch, there is no way to do a repair between the MV and the base table other than dropping and recreating the view. Would this be worth a follow-up ticket, or would the repair overhead / time frame be long enough that it makes sense just to recommend people drop/recreate MV after repair? If so, would it make sense to bolt that functionality as an option onto the repair process?
          Hide
          carlyeks Carl Yeksigian added a comment -

          It should at the very least be a follow-up ticket; it deserves a discussion, including what the overhead will be, and whether it is worth it.

          Show
          carlyeks Carl Yeksigian added a comment - It should at the very least be a follow-up ticket; it deserves a discussion, including what the overhead will be, and whether it is worth it.
          Hide
          jbellis Jonathan Ellis added a comment -

          Can't we wire up streamed updates to MV replication to make that unnecessary, the way we do with 2i?

          Show
          jbellis Jonathan Ellis added a comment - Can't we wire up streamed updates to MV replication to make that unnecessary, the way we do with 2i?
          Hide
          carlyeks Carl Yeksigian added a comment -

          That just ensures that the values which are currently in the base table will also be in the MV, but not that extraneous values present in the MV are removed. We can also repair the MV using the current repair process, which should fix the inconsistencies which streaming the updates to the MV would fix.

          Show
          carlyeks Carl Yeksigian added a comment - That just ensures that the values which are currently in the base table will also be in the MV, but not that extraneous values present in the MV are removed. We can also repair the MV using the current repair process, which should fix the inconsistencies which streaming the updates to the MV would fix.
          Hide
          sebastian.estevez@datastax.com Sebastian Estevez added a comment - - edited

          2i's are a painpoint during bootstraps and repairs because they are rebuilt after streaming. Will this be the case for MV's as well?
          Ideally, they could be streamed as well. Much less intensive operation both CPU and IO wise.

          Show
          sebastian.estevez@datastax.com Sebastian Estevez added a comment - - edited 2i's are a painpoint during bootstraps and repairs because they are rebuilt after streaming. Will this be the case for MV's as well? Ideally, they could be streamed as well. Much less intensive operation both CPU and IO wise.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -
          Overall:

          Needs comments. Class level, javadoc where appropriate, etc.

          CFMetaData
          • duplicate entry for comparison if triggers added in CFMetadata.triggers
          • nit: space after CFMetaData.getMaterializedViews function end
          MaterializedViewManager
          • buildIfRequired: scope of whether it's required or skipped isn't in this method or within view.build(), so it looks like it's just building (or at least submitting the builder to the CompactionManager)
          • You're using SystemKeyspace.setIndexRemoved to mark the MV removed in removeMaterializedView but not the parallel setIndexBuilt to set them as built.
          • allViews has unnecessary genericity
          • nits:
            • Some unused imports
            • Consider renaming reload to indicate what it's reloading (e.g. reloadViews, reloadChickens, reloadDolphins)
          MaterializedView
          • createForDeletionInfo:
            • Consider caching/precomputing the results of CFMetaData.hasComplexColumns rather than iterating through them twice on each call (once on CFMetaData.hascomplexColumns, again in for loop in createForDeletionInfo)
              • Perhaps hooking into addOrReplaceColumnDefinition, removeColumnDefinition and updating MV's with the data, keep a set inside MaterializedView and reference that
          • Don't need MaterializedView.reload() as it's just a passthrough to build and used in 1 place
          • Tighten up scoping on member variables - some package private that can be explicitly private
          • ctor: Duplication in the ctor in treatment of MVDefinition partition and clustering columns - could use a slight refactor to a function.
          • createTombstone / createComplexTombstone: refactor out building viewClusteringValues into method to remove duplication
          • targetPartitionKey: collapse spacing on:
            return viewCfs
               .partitioner
               .decorateKey(CFMetaData
                              .serializePartitionKey(viewCfs
                                                   .metadata
                                                   .getKeyValidatorAsClusteringComparator()
                                                   .make(partitionKey)));
            

            to something like:

            return viewCfs.partitioner.decorateKey(CFMetaData.serializePartitionKey(viewCfs.metadata
                                                                                    .getKeyValidatorAsClusteringComparator()
                                                                                    .make(partitionKey)));
            
          • createPartitionTombstonesForUpdates: Can simply return mutation @ end of function
          • createForDeletionInfo: unused parameter consistency
          • Inconsistent naming - using mutationUnits scattered throughout vs. LiveRowState
          • Rename query to queryMaterializedViewData or something similar that denotes what you're querying. Perhaps buildMVData
          • nits:
            • Some unused imports
            • ctor: Change nonPrimaryKeyCol to allPrimaryKeyCol as true, flip to false when non found so we're not assigning a double-negative to targetHasAllPrimaryKeyColumns
            • extra space in MaterializedView.createForDeletionInfo
            • ctor declaration fits on 1 line < 120 char, same for some other lines that have been wrapped
            • hte in javadoc for targetPartitionKey
          MaterializedViewDefinition
          • Consider caching sets of columns in the MV rather than pulling from CFMetaData and iterating. This would allow for faster lookup and simpler code than having to iterate across all members (MaterializedViewDefinition.selects for instance).
          • Consider using Sets of columns rather than Lists internally, would remove a lot of O(N) lookups and duplicate code logic scattered throughout the class (selects, renameColumn, etc)
          • selects(): if included.isEmpty() is apparently an indicator that it includes all. We should 1) comment that and 2) wrap a method around that behavior, e.g. 'boolean isAllInclusive()' that documents that behavior in the code.
          • nits:
            • I prefer copy constructors to a .copy() method and I think I've seen us err on the side of the copy constructor. I could be wrong here.
            • assert included != null and !isEmpty on ctor
            • extra whitespace in renameColumn @ top of method, @ end of method
          LiveRowState
          • The name conflates with LivenessInfo (at least for me) which we discussed on IRC. My biggest hurdle to grokking this class was un-plugging that and re-plugging the idea that it's really about materializing the data for the MaterializedView. Perhaps rename to something like TemporalRow, with LRSCell becoming TemporalCell? While LRSCell doesn't really have any temporal data above and beyond a general Cell, it does contain its own reconcile that takes temporality into account which makes sense.
          • Instead of addUnit / getExistingUnit in LiveRowState.Set, if going with the above these could be named addRow, getExistingRow
          • Think we can get rid of getInternedUnit entirely and just have a section in addUnit to initialize empty entries
          • Consider having LRSCell implement db.rows.Cell and adding a Conflicts.resolveCells(Cell left, Cell right, int nowInSec) method that passes through to Conficts.resolveRegular. Simplifies up a few different users of the method throughout the code-base.
          • Can do away w/PrivateResolver vs. Resolver. LRSCell being private should prevent non-class anonymous instantiations of the interface.
          • Various unused methods and constructors
          • Comment on whether or not retention of ordering is important in baseSlice as you could do away w/interim ByteBuffer array if not. I believe it is, but worth clarifying.
          • nit: whitespace inconsistencies in values(...)

          I definitely have more to review but figured I'd get the feedback I have thus far into here. I'm considering doing subsequent reviews as a branch I push to github w/comments inline (similar to what Benedict's been doing lately) with feedback as putting this quantity of feedback in Jira comments is a burden for both you and I - thoughts?

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Overall: Needs comments. Class level, javadoc where appropriate, etc. CFMetaData duplicate entry for comparison if triggers added in CFMetadata.triggers nit: space after CFMetaData.getMaterializedViews function end MaterializedViewManager buildIfRequired: scope of whether it's required or skipped isn't in this method or within view.build(), so it looks like it's just building (or at least submitting the builder to the CompactionManager) You're using SystemKeyspace.setIndexRemoved to mark the MV removed in removeMaterializedView but not the parallel setIndexBuilt to set them as built. allViews has unnecessary genericity nits: Some unused imports Consider renaming reload to indicate what it's reloading (e.g. reloadViews, reloadChickens, reloadDolphins) MaterializedView createForDeletionInfo: Consider caching/precomputing the results of CFMetaData.hasComplexColumns rather than iterating through them twice on each call (once on CFMetaData.hascomplexColumns, again in for loop in createForDeletionInfo) Perhaps hooking into addOrReplaceColumnDefinition, removeColumnDefinition and updating MV's with the data, keep a set inside MaterializedView and reference that Don't need MaterializedView.reload() as it's just a passthrough to build and used in 1 place Tighten up scoping on member variables - some package private that can be explicitly private ctor: Duplication in the ctor in treatment of MVDefinition partition and clustering columns - could use a slight refactor to a function. createTombstone / createComplexTombstone: refactor out building viewClusteringValues into method to remove duplication targetPartitionKey: collapse spacing on: return viewCfs .partitioner .decorateKey(CFMetaData .serializePartitionKey(viewCfs .metadata .getKeyValidatorAsClusteringComparator() .make(partitionKey))); to something like: return viewCfs.partitioner.decorateKey(CFMetaData.serializePartitionKey(viewCfs.metadata .getKeyValidatorAsClusteringComparator() .make(partitionKey))); createPartitionTombstonesForUpdates: Can simply return mutation @ end of function createForDeletionInfo: unused parameter consistency Inconsistent naming - using mutationUnits scattered throughout vs. LiveRowState Rename query to queryMaterializedViewData or something similar that denotes what you're querying. Perhaps buildMVData nits: Some unused imports ctor: Change nonPrimaryKeyCol to allPrimaryKeyCol as true, flip to false when non found so we're not assigning a double-negative to targetHasAllPrimaryKeyColumns extra space in MaterializedView.createForDeletionInfo ctor declaration fits on 1 line < 120 char, same for some other lines that have been wrapped hte in javadoc for targetPartitionKey MaterializedViewDefinition Consider caching sets of columns in the MV rather than pulling from CFMetaData and iterating. This would allow for faster lookup and simpler code than having to iterate across all members (MaterializedViewDefinition.selects for instance). Consider using Sets of columns rather than Lists internally, would remove a lot of O(N) lookups and duplicate code logic scattered throughout the class (selects, renameColumn, etc) selects(): if included.isEmpty() is apparently an indicator that it includes all. We should 1) comment that and 2) wrap a method around that behavior, e.g. 'boolean isAllInclusive()' that documents that behavior in the code. nits: I prefer copy constructors to a .copy() method and I think I've seen us err on the side of the copy constructor. I could be wrong here. assert included != null and !isEmpty on ctor extra whitespace in renameColumn @ top of method, @ end of method LiveRowState The name conflates with LivenessInfo (at least for me) which we discussed on IRC. My biggest hurdle to grokking this class was un-plugging that and re-plugging the idea that it's really about materializing the data for the MaterializedView. Perhaps rename to something like TemporalRow , with LRSCell becoming TemporalCell ? While LRSCell doesn't really have any temporal data above and beyond a general Cell, it does contain its own reconcile that takes temporality into account which makes sense. Instead of addUnit / getExistingUnit in LiveRowState.Set, if going with the above these could be named addRow , getExistingRow Think we can get rid of getInternedUnit entirely and just have a section in addUnit to initialize empty entries Consider having LRSCell implement db.rows.Cell and adding a Conflicts.resolveCells(Cell left, Cell right, int nowInSec) method that passes through to Conficts.resolveRegular . Simplifies up a few different users of the method throughout the code-base. Can do away w/PrivateResolver vs. Resolver. LRSCell being private should prevent non-class anonymous instantiations of the interface. Various unused methods and constructors Comment on whether or not retention of ordering is important in baseSlice as you could do away w/interim ByteBuffer array if not. I believe it is, but worth clarifying. nit: whitespace inconsistencies in values(...) I definitely have more to review but figured I'd get the feedback I have thus far into here. I'm considering doing subsequent reviews as a branch I push to github w/comments inline (similar to what Benedict's been doing lately) with feedback as putting this quantity of feedback in Jira comments is a burden for both you and I - thoughts?
          Hide
          thobbs Tyler Hobbs added a comment -

          I'm considering doing subsequent reviews as a branch I push to github w/comments inline (similar to what Benedict's been doing lately) with feedback as putting this quantity of feedback in Jira comments is a burden for both you and I - thoughts?

          We should probably discuss this on the dev ML, but here are my thoughts. It's way more convenient to comment on Github, but, it's also nice to keep the discussion in Jira so that it's easier to understand past design decisions when you look at tickets. Maybe there's a nice middle-ground?

          Show
          thobbs Tyler Hobbs added a comment - I'm considering doing subsequent reviews as a branch I push to github w/comments inline (similar to what Benedict's been doing lately) with feedback as putting this quantity of feedback in Jira comments is a burden for both you and I - thoughts? We should probably discuss this on the dev ML, but here are my thoughts. It's way more convenient to comment on Github, but , it's also nice to keep the discussion in Jira so that it's easier to understand past design decisions when you look at tickets. Maybe there's a nice middle-ground?
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          it's also nice to keep the discussion in Jira so that it's easier to understand past design decisions when you look at tickets

          I was thinking about that. We absolutely don't want to lose discussion history; I was thinking naming these types of branches <ticketnum>_review and having a policy of never deleting _review branches so they could be referenced for historical purposes. Not a particularly official / formal approach however with no safeguards other than behavior, and also dependent on github not archiving/removing old branches, space issues on there, etc.

          The difficulty of transposing review information into a jira comment and convenience is secondary to the fact that this approach takes my comments out of the immediate context of what I'm thinking, requiring a translation from me to here and from here back to whomever; my worry is that there's more chance of something getting lost in translation there.

          But yeah - dev ML seems the way to go for this topic.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - it's also nice to keep the discussion in Jira so that it's easier to understand past design decisions when you look at tickets I was thinking about that. We absolutely don't want to lose discussion history; I was thinking naming these types of branches <ticketnum>_review and having a policy of never deleting _review branches so they could be referenced for historical purposes. Not a particularly official / formal approach however with no safeguards other than behavior, and also dependent on github not archiving/removing old branches, space issues on there, etc. The difficulty of transposing review information into a jira comment and convenience is secondary to the fact that this approach takes my comments out of the immediate context of what I'm thinking, requiring a translation from me to here and from here back to whomever; my worry is that there's more chance of something getting lost in translation there. But yeah - dev ML seems the way to go for this topic.
          Hide
          jkrupan Jack Krupansky added a comment - - edited

          I don't see updated CQL.textile for CREATE MV on the branch. Coming soon?

          Also, the comment for CREATE MV in CQL.g does not quite match the actual syntax:

          1. Missing the IF NOT EXISTS clause.
          2. Has parentheses around the <columns> list, but SELECT does not have that.
          3. Unclear whether AS or functions are supported in the column name list, but selectStatement would certainly allow that.
          4. Has FROM (<columnName>), which should be FROM <CF>, I think.

          Show
          jkrupan Jack Krupansky added a comment - - edited I don't see updated CQL.textile for CREATE MV on the branch. Coming soon? Also, the comment for CREATE MV in CQL.g does not quite match the actual syntax: 1. Missing the IF NOT EXISTS clause. 2. Has parentheses around the <columns> list, but SELECT does not have that. 3. Unclear whether AS or functions are supported in the column name list, but selectStatement would certainly allow that. 4. Has FROM (<columnName>), which should be FROM <CF>, I think.
          Hide
          aboudreault Alan Boudreault added a comment -

          This morning, I did some initial performance testing to get a first overview. I tested the following scenarios:

          • Writing without MV
          • Writing with 1 MV
          • Writing with 3 MV
          • Writing with 5 MV

          Results: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=cassandra-mv-benchmark-2015-07-08.log&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=341.77&ymin=0&ymax=89079.1

          Briefly, with 1 MV created, which produces more mutations internally.. we got 33% of the initial performance. Surprisingly, I don't see any performance decrease with 3 and 5 MVs as I would have expected. This leads us to think that it might be caused by bottleneck (CPU? IO? some internal locks?)

          My next step is to profile what's going on to understand more these results.

          Show
          aboudreault Alan Boudreault added a comment - This morning, I did some initial performance testing to get a first overview. I tested the following scenarios: Writing without MV Writing with 1 MV Writing with 3 MV Writing with 5 MV Results: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=cassandra-mv-benchmark-2015-07-08.log&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=341.77&ymin=0&ymax=89079.1 Briefly, with 1 MV created, which produces more mutations internally.. we got 33% of the initial performance. Surprisingly, I don't see any performance decrease with 3 and 5 MVs as I would have expected. This leads us to think that it might be caused by bottleneck (CPU? IO? some internal locks?) My next step is to profile what's going on to understand more these results.
          Hide
          tjake T Jake Luciani added a comment -

          Alan Boudreault can you compare with counters and secondary indexes for some relative performance numbers?

          Show
          tjake T Jake Luciani added a comment - Alan Boudreault can you compare with counters and secondary indexes for some relative performance numbers?
          Hide
          aboudreault Alan Boudreault added a comment -

          Sure, will do today.

          Show
          aboudreault Alan Boudreault added a comment - Sure, will do today.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          And, if possible, a comparison with manual denormalisation (in case we don't need read before write and have all the columns known to perform both writes).

          Show
          iamaleksey Aleksey Yeschenko added a comment - And, if possible, a comparison with manual denormalisation (in case we don't need read before write and have all the columns known to perform both writes).
          Hide
          aboudreault Alan Boudreault added a comment -

          Okay, I've been working on these comparisons but haven't been able to provide useful results due to an issue I hit. I am doing my benchmarks on ec2 with a cluster of 3 nodes. Basically, I can get realistic and useful results with C* stock (no MV) and C* with a Secondary Index ( between 70000 and 85000 op/s). When it comes to testing C* with 1 MV, I got many many WriteTimeoutExceptions which results in a performance of 100 operations per second. I have been able to reproduce that 100 op/s locally using a 3 nodes cluster. The issue doesn't seem to be present when using a single node cluster.

          I've profiled one of the node and it looks like most of the time is spend in io.netty.channel.epoll.EpollEventLoop.epollWait() (like 75% of the time).

          Here's a yourkit snapshot of the first node of the cluster.
          http://dl.alanb.ca/CassandraDaemon-cluster-3-nodes-2015-07-10.snapshot.zip

          I've attached my users.yaml profile that I am using for testing: users.yaml

          Here's the materialized view creation statement:

          CREATE MATERIALIZED VIEW perftesting.users_by_first_name AS SELECT * FROM perftesting.users PRIMARY KEY (first_name);
          

          Here's the stress command I've been using:

          cassandra-stress user profile=/path/to/users.yaml ops\(insert=1\) n=5000000 no-warmup -pop seq=1..200M  no-wrap -rate threads=200 -node 127.0.0.1,127.0.0.2,127.0.0.3
          

          Let me know if I am doing anything wrong or if I can provide anything else to help. I'll provide the benchmarks as soon as I have a workaround for this issue.

          Show
          aboudreault Alan Boudreault added a comment - Okay, I've been working on these comparisons but haven't been able to provide useful results due to an issue I hit. I am doing my benchmarks on ec2 with a cluster of 3 nodes. Basically, I can get realistic and useful results with C* stock (no MV) and C* with a Secondary Index ( between 70000 and 85000 op/s). When it comes to testing C* with 1 MV, I got many many WriteTimeoutExceptions which results in a performance of 100 operations per second. I have been able to reproduce that 100 op/s locally using a 3 nodes cluster. The issue doesn't seem to be present when using a single node cluster. I've profiled one of the node and it looks like most of the time is spend in io.netty.channel.epoll.EpollEventLoop.epollWait() (like 75% of the time). Here's a yourkit snapshot of the first node of the cluster. http://dl.alanb.ca/CassandraDaemon-cluster-3-nodes-2015-07-10.snapshot.zip I've attached my users.yaml profile that I am using for testing: users.yaml Here's the materialized view creation statement: CREATE MATERIALIZED VIEW perftesting.users_by_first_name AS SELECT * FROM perftesting.users PRIMARY KEY (first_name); Here's the stress command I've been using: cassandra-stress user profile=/path/to/users.yaml ops\(insert=1\) n=5000000 no-warmup -pop seq=1..200M no-wrap -rate threads=200 -node 127.0.0.1,127.0.0.2,127.0.0.3 Let me know if I am doing anything wrong or if I can provide anything else to help. I'll provide the benchmarks as soon as I have a workaround for this issue.
          Hide
          aboudreault Alan Boudreault added a comment -

          Using Carl's branch 6477-rebase

          Show
          aboudreault Alan Boudreault added a comment - Using Carl's branch 6477-rebase
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          Thus far, the implementation looks pretty solid. Fairly self-documenting (pending some of the naming issues we discussed above). I have more code to go through but have some more feedback.

          Questions:

          1. Currently we're submitting the MaterializedViewBuilder to the standard executor in CompactionManager. Is it logical to have these 2 operations share a thread pool and resources?
          2. Why the removal of updatePKIndexes from Cells.java.reconcile?

          Feedback:

          General
          • Why are concurrent_batchlog_writes and concurrent_materialized_view_writes both hard-coded to 32 and not-in the yaml?
          AlterTableStatement
          • mv.included.isEmpty() as a check to see if all columns are included is counter-intuitive. Add a helper function mv.included.containsAll() (mentioned prior, may be addressed in subsequent commits)
          • announceMigration: materializedViewDrops is never initialized or used
          CreateMaterializedViewStatement
          • in CreateMaterializedViewStatement.announceMigration, while turning ColumnIdentifier.Raw into ColumnIdentifier, we allow <= 1 non-pk column in a MV partition key however the error message we log on multiple attempts reads "Cannot include non-primary key column '%s' in materialized view partition key". We should log that <= 1 are allowed instead. We should also document why this restriction is in-place in the code.
          • refactor out duplication w/building targetPartitionKeys and targetPartitionColumns w/nonPKTarget
          DropMaterializedViewStatement
          • In findMaterializedView, you don't need to iterate across all the members of cfm.getMaterializedViews().values() as it's a Map, you can just check for whether or not it contains a member at columnFamily() index.
          SSTableIterator
          • l252. Was this addressing a bug you uncovered during development or an accidental change? Your update is changing what we're comparing to indexes.size() by pre-incrementing currentIndexIdx before said comparison.
          MaterializedViewBuilder
          • MaterializedViewBuilder.getCompactionInfo isn't giving us particularly good information about the # Tokens built vs. total. We discussed this offline - needs some comments in the code as to why it's currently limited in this fashion.
          MaterializedViewUtils
          • Should probably add a comment explaining why baseNaturalEndpoints and viewNaturalEndpoints always have the same # of entries, so .get on baseIdx is safe. Otherwise it takes a lot of knowledge about the MV RF implementation to understand it (thinking about future developers here)
          AbstractReadCommandBuilder
          • In .makeColumnFilter() - Why the change to the temporary stack ptr for CFMetaData?
          • It's not immediately clear to me why you changed from the allColumnsBuilder to the selectionBuilder - could use some clarification on that (for me here, not necessarily comments on code)
          ColumnFamilyStore
          • Pull contents of initRowCache into init() -> rather than breaking into 2 separate methods, just have the 1 renamed w/MVM init in it
          Nits:
          • SingleColumnRestriction: unnecessary whitespace changes
          • AlterTableStatement:
            • extraneous whitespace in announceMigration
            • tab in announceMigration modified to break 120 char
          • CreateMaterializedViewStatement:
            • unused imports
            • Double whitespace between methods
          • DropMaterializedViewStatement: "Cannot drop non existing" should be "Cannot drop non-existent"
          • CompactionManager: Need space after submitMaterializedViewBuilder method close
          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Thus far, the implementation looks pretty solid. Fairly self-documenting (pending some of the naming issues we discussed above). I have more code to go through but have some more feedback. Questions: Currently we're submitting the MaterializedViewBuilder to the standard executor in CompactionManager. Is it logical to have these 2 operations share a thread pool and resources? Why the removal of updatePKIndexes from Cells.java.reconcile? Feedback: General Why are concurrent_batchlog_writes and concurrent_materialized_view_writes both hard-coded to 32 and not-in the yaml? AlterTableStatement mv.included.isEmpty() as a check to see if all columns are included is counter-intuitive. Add a helper function mv.included.containsAll() (mentioned prior, may be addressed in subsequent commits) announceMigration: materializedViewDrops is never initialized or used CreateMaterializedViewStatement in CreateMaterializedViewStatement.announceMigration, while turning ColumnIdentifier.Raw into ColumnIdentifier, we allow <= 1 non-pk column in a MV partition key however the error message we log on multiple attempts reads "Cannot include non-primary key column '%s' in materialized view partition key". We should log that <= 1 are allowed instead. We should also document why this restriction is in-place in the code. refactor out duplication w/building targetPartitionKeys and targetPartitionColumns w/nonPKTarget DropMaterializedViewStatement In findMaterializedView, you don't need to iterate across all the members of cfm.getMaterializedViews().values() as it's a Map, you can just check for whether or not it contains a member at columnFamily() index. SSTableIterator l252. Was this addressing a bug you uncovered during development or an accidental change? Your update is changing what we're comparing to indexes.size() by pre-incrementing currentIndexIdx before said comparison. MaterializedViewBuilder MaterializedViewBuilder.getCompactionInfo isn't giving us particularly good information about the # Tokens built vs. total. We discussed this offline - needs some comments in the code as to why it's currently limited in this fashion. MaterializedViewUtils Should probably add a comment explaining why baseNaturalEndpoints and viewNaturalEndpoints always have the same # of entries, so .get on baseIdx is safe. Otherwise it takes a lot of knowledge about the MV RF implementation to understand it (thinking about future developers here) AbstractReadCommandBuilder In .makeColumnFilter() - Why the change to the temporary stack ptr for CFMetaData? It's not immediately clear to me why you changed from the allColumnsBuilder to the selectionBuilder - could use some clarification on that (for me here, not necessarily comments on code) ColumnFamilyStore Pull contents of initRowCache into init() -> rather than breaking into 2 separate methods, just have the 1 renamed w/MVM init in it Nits: SingleColumnRestriction: unnecessary whitespace changes AlterTableStatement: extraneous whitespace in announceMigration tab in announceMigration modified to break 120 char CreateMaterializedViewStatement: unused imports Double whitespace between methods DropMaterializedViewStatement: "Cannot drop non existing" should be "Cannot drop non-existent" CompactionManager: Need space after submitMaterializedViewBuilder method close
          Hide
          jkrupan Jack Krupansky added a comment -

          we allow <= 1 non-pk column in a MV partition key however the error message we log on multiple attempts reads "Cannot include non-primary key column '%s' in materialized view partition key". We should log that <= 1 are allowed instead.

          Wow, is that really true? It sounds like a crippling restriction. Is that simply a short-term expediency for the initial elease or a hard-core long-term restriction?

          Just as an example if I had a table with name, address and id, with id as the primary key, I couldn't have an MV with just name or just address or just name and address as the partition key, right?

          In particular, this restriction seems to preclude pure inverted index MVs - where the non-key content of the row is used to index the key for the row.

          Still waiting to read an updated CQL spec - especially any such limitations.

          Show
          jkrupan Jack Krupansky added a comment - we allow <= 1 non-pk column in a MV partition key however the error message we log on multiple attempts reads "Cannot include non-primary key column '%s' in materialized view partition key". We should log that <= 1 are allowed instead. Wow, is that really true? It sounds like a crippling restriction. Is that simply a short-term expediency for the initial elease or a hard-core long-term restriction? Just as an example if I had a table with name, address and id, with id as the primary key, I couldn't have an MV with just name or just address or just name and address as the partition key, right? In particular, this restriction seems to preclude pure inverted index MVs - where the non-key content of the row is used to index the key for the row. Still waiting to read an updated CQL spec - especially any such limitations.
          Hide
          jbellis Jonathan Ellis added a comment -

          A couple points from Christopher Batey's blog post:

          You can include any part of the original primary key in your MV primary key and a single column that was not part of your original primary key

          I think this is the same thing Jack is pointing out above. Is there a compelling technical reason behind this? We lose a lot of flexibility if we can't have multiple non-pk columns in either partition key or clustering.

          Any part of the original primary key you don't use will be added to the end of your clustering columns to keep it a one to one mapping

          IMO we should return an error and let them correct it rather than silently changing the definition from what was given.

          If the part of your primary key is NULL then it won't appear in the materialised view

          This also sounds like something we should fix or it will bite people in unpleasant ways.

          Show
          jbellis Jonathan Ellis added a comment - A couple points from Christopher Batey 's blog post : You can include any part of the original primary key in your MV primary key and a single column that was not part of your original primary key I think this is the same thing Jack is pointing out above. Is there a compelling technical reason behind this? We lose a lot of flexibility if we can't have multiple non-pk columns in either partition key or clustering. Any part of the original primary key you don't use will be added to the end of your clustering columns to keep it a one to one mapping IMO we should return an error and let them correct it rather than silently changing the definition from what was given. If the part of your primary key is NULL then it won't appear in the materialised view This also sounds like something we should fix or it will bite people in unpleasant ways.
          Hide
          tjake T Jake Luciani added a comment -

          You can include any part of the original primary key in your MV primary key and a single column that was not part of your original primary key

          The reason is we don't have a way to enforce NOT NULL for non-PK keys. If you have two non-PK cols as part of the view PK then how can you update the view if the other is NULL.

          I was thinking if we added a NOT NULL syntax to CQL column defintions like exist in SQL we could probably support this.

          Show
          tjake T Jake Luciani added a comment - You can include any part of the original primary key in your MV primary key and a single column that was not part of your original primary key The reason is we don't have a way to enforce NOT NULL for non-PK keys. If you have two non-PK cols as part of the view PK then how can you update the view if the other is NULL. I was thinking if we added a NOT NULL syntax to CQL column defintions like exist in SQL we could probably support this.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          But if we know there is an MV with cetrain columns, we can perform the validation even without NOT NULL, just inferring the constraint from presence of the MV?

          Show
          iamaleksey Aleksey Yeschenko added a comment - But if we know there is an MV with cetrain columns, we can perform the validation even without NOT NULL , just inferring the constraint from presence of the MV?
          Hide
          jjordan Jeremiah Jordan added a comment - - edited

          But what about data inserted before the MV existed? Also what about UPDATEs/INSERTs that don't include all columns? Would you have to include it? What if it already has a value on disk? Why can't MV just deal with null as a valid value?

          Show
          jjordan Jeremiah Jordan added a comment - - edited But what about data inserted before the MV existed? Also what about UPDATEs/INSERTs that don't include all columns? Would you have to include it? What if it already has a value on disk? Why can't MV just deal with null as a valid value?
          Hide
          tjake T Jake Luciani added a comment - - edited

          What I'm saying is we can't insert nulls into clustering or partition keys. we don't support it. So we can't put multiple non-pk columns into a materialized view PK we (of course they can be in the non-PK columns of the view)

          cqlsh:test> create table test(foo text, bar text, baz text, PRIMARY KEY(foo,bar));
          cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( 'a', 'b', 'c');
          cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( 'a', null, 'c');
          InvalidRequest: code=2200 [Invalid query] message="Invalid null value for clustering key part bar"
          cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( null, 'b', 'c');
          InvalidRequest: code=2200 [Invalid query] message="Invalid null value for partition key part foo"
          cqlsh:test> 
          
          
          Show
          tjake T Jake Luciani added a comment - - edited What I'm saying is we can't insert nulls into clustering or partition keys. we don't support it. So we can't put multiple non-pk columns into a materialized view PK we (of course they can be in the non-PK columns of the view) cqlsh:test> create table test(foo text, bar text, baz text, PRIMARY KEY(foo,bar)); cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( 'a', 'b', 'c'); cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( 'a', null , 'c'); InvalidRequest: code=2200 [Invalid query] message= "Invalid null value for clustering key part bar" cqlsh:test> INSERT INTO test(foo, bar, baz) VALUES ( null , 'b', 'c'); InvalidRequest: code=2200 [Invalid query] message= "Invalid null value for partition key part foo" cqlsh:test>
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          We can disallow such updates (ones that don't have all the columns that are in a PK in one of the views). For data that's already there, we'll error out during the build, and have the operator correct it.

          Show
          iamaleksey Aleksey Yeschenko added a comment - We can disallow such updates (ones that don't have all the columns that are in a PK in one of the views). For data that's already there, we'll error out during the build, and have the operator correct it.
          Hide
          tjake T Jake Luciani added a comment -

          and have the operator correct it.

          How can you correct it? Re-insert all the data? that doesn't sound like a good plan. Esp when considering you'd have a half broken MV sitting there.

          Show
          tjake T Jake Luciani added a comment - and have the operator correct it. How can you correct it? Re-insert all the data? that doesn't sound like a good plan. Esp when considering you'd have a half broken MV sitting there.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I must be missing something.

          If you have just one base-table regular column in MV's partition key (the current restriction), and issue a CREATE MATERIALIZED VIEW, and not all of the rows have that column present (it's null) - what happens to rebuild then?

          Show
          iamaleksey Aleksey Yeschenko added a comment - I must be missing something. If you have just one base-table regular column in MV's partition key (the current restriction), and issue a CREATE MATERIALIZED VIEW , and not all of the rows have that column present (it's null ) - what happens to rebuild then?
          Hide
          tjake T Jake Luciani added a comment -

          and not all of the rows have that column present (it's null) - what happens to rebuild then?

          Well by definition then the MV row is null because part of the PK is null. But in the existing case this is a 1:1 relationship since we only allow one Nullable column in the PK of the MV (when base is null then view is null). If we applied the same idea to multiple Nullable fields then we would end up with a much more confusing situation.
          Consider building from inconsistent replicas:
          Let's say you put non-PK columns A and B into a Materialized view. Replica 1 has All of Column A, Replica 2 has All of column B. The build would end up with no data in the MV. You would need to subsequentally repair the data to build the MV.

          Also, in general I'm not sure ATM how to support things like multiple conflicting TTLs across non-PK columns.

          Show
          tjake T Jake Luciani added a comment - and not all of the rows have that column present (it's null) - what happens to rebuild then? Well by definition then the MV row is null because part of the PK is null. But in the existing case this is a 1:1 relationship since we only allow one Nullable column in the PK of the MV (when base is null then view is null). If we applied the same idea to multiple Nullable fields then we would end up with a much more confusing situation. Consider building from inconsistent replicas: Let's say you put non-PK columns A and B into a Materialized view. Replica 1 has All of Column A, Replica 2 has All of column B. The build would end up with no data in the MV. You would need to subsequentally repair the data to build the MV. Also, in general I'm not sure ATM how to support things like multiple conflicting TTLs across non-PK columns.
          Hide
          JoshuaMcKenzie Joshua McKenzie added a comment -

          Hate to interrupt this lively debate about null handling, but I have more feedback for Carl.

          Last set of style/lower level feedback here. Going to start tracing the various paths and confirm that the current logic affords the guarantees we think they do as well as look into the schema a bit more.

          A couple of questions:

          1. How clear to users is it that CAS operations are disabled by adding a MV to a CF? Is this going to be documentation-specific or is this going to be something we pop up in CQL as well?
          2. In MaterializedViewLongTest, why are you ignoring all Throwables from the BatchLogManager.instance.startBatchlogReplay().get() call?
          MaterializedView
          • Consider renaming cfModifiesSelectedColumn. Maybe "updateAffectsView(AbstractPartitionData apd)"?
            • Same w/MaterializedViewManager.cfModifiesSelectedColumn. "cfModifies" is unclear - what's the cf that's being referred to here, the legacy "ColumnFamily" nomenclature that's no longer used that's actually a PartitionUpdate?
          DeletionTimeArray
          • Looks like an accidental change to isLive snuck in
          Keyspace
          • I have some concerns about the garbage/performance impact of a call to MaterializedViewManager.touchesSelectedColumn on each Keyspace.apply. While I'm fine letting MV performance optimization efforts be a follow-on effort (largely due to the fact that the MV code is well abstracted), I'd like some more testing / assurances that the places where we've modified the critical path don't introduce performance regressions.
          RowUpdateBuilder
          • Need to rephrase the following error message: "since no clustering hasn't been provided". I realize you just copied the one that was already in addListEntry, but may as well fix the double-negatives while we're here.
          SystemKeyspace
          • Consider renaming the schema values for MATERIALIZEDVIEW_BUILDS and BUILT_MATERIALIZEDVIEWS to be consistent w/other schema entries. Maybe follow COMPACTIONS_IN_PROGRESS w/MATERIALIZED_VIEW_BUILDS_IN_PROGRESS = "materielizedviews_builds_in_progress"? The current values don't immediately differentiate themselves from one another to me.
          TrncateVerbHandler
          • We should probably move MV truncation into cfs.truncateBlocking as MV are children of the CFS, rather than have it being a part of the VerbHandler.
          StorageProxy
          • Need clearer differentiation between the 2 wrapBatchResponseHandler methods rather than just a comment. If initiating writes is a side-effect of the method, its name should reflect that.
          • I'm also not sure that comment is accurate as the difference between the 2 methods currently appears to be only around naturalEndpoint determination
          MaterializedViewLongTest
          • ttlTest looks like it should be a regular unit test rather than a long
          MaterializedViewTest
          • Comment on why you're swallowing exceptions
          • Comment tests to indicate the use-cases they're testing
          • Do we really need to sleep in 1000ms increments on a CREATE MATERIALIZED VIEW statement? (This is more a question on expected performance w/a base CF w/100 records than anything)
          • testAllTypes is going to atrophy when new types are introduced
          MaterializedViewUtilsTest
          • Needs comments
          nits
          • Unused imports in Keyspace.java
          • Double-whitespace in Keyspace.apply
          • Unused import in RowUpdateBuilder.java
          • Double-whitespace in StorageProxy.mutateAtomically
          • Remove left over debug logging in StorageProxy.mutateAtomically
          • MaterializedViewLongTest: double whitespacing, unused import
          • MaterializedViewTest: unused imports, double whitespace
          • MaterializedViewUtilsTest: unused imports
          Things I'd like to see performance tested
          • Pre-MV vs. Post-MV patch, regular read/write/mixed throughput and latency
          • Pre-MV vs. Post-MV patch, batchlog stress

          Given the changes to non-MV queries on the critical path, I think it's important we at least confirm we're not introducing any obvious performance regressions to those query types when excluding the MV code-paths from apply calls.

          Show
          JoshuaMcKenzie Joshua McKenzie added a comment - Hate to interrupt this lively debate about null handling, but I have more feedback for Carl. Last set of style/lower level feedback here. Going to start tracing the various paths and confirm that the current logic affords the guarantees we think they do as well as look into the schema a bit more. A couple of questions: How clear to users is it that CAS operations are disabled by adding a MV to a CF? Is this going to be documentation-specific or is this going to be something we pop up in CQL as well? In MaterializedViewLongTest, why are you ignoring all Throwables from the BatchLogManager.instance.startBatchlogReplay().get() call? MaterializedView Consider renaming cfModifiesSelectedColumn. Maybe "updateAffectsView(AbstractPartitionData apd)"? Same w/MaterializedViewManager.cfModifiesSelectedColumn. "cfModifies" is unclear - what's the cf that's being referred to here, the legacy "ColumnFamily" nomenclature that's no longer used that's actually a PartitionUpdate? DeletionTimeArray Looks like an accidental change to isLive snuck in Keyspace I have some concerns about the garbage/performance impact of a call to MaterializedViewManager.touchesSelectedColumn on each Keyspace.apply. While I'm fine letting MV performance optimization efforts be a follow-on effort (largely due to the fact that the MV code is well abstracted), I'd like some more testing / assurances that the places where we've modified the critical path don't introduce performance regressions. RowUpdateBuilder Need to rephrase the following error message: "since no clustering hasn't been provided". I realize you just copied the one that was already in addListEntry, but may as well fix the double-negatives while we're here. SystemKeyspace Consider renaming the schema values for MATERIALIZEDVIEW_BUILDS and BUILT_MATERIALIZEDVIEWS to be consistent w/other schema entries. Maybe follow COMPACTIONS_IN_PROGRESS w/MATERIALIZED_VIEW_BUILDS_IN_PROGRESS = "materielizedviews_builds_in_progress"? The current values don't immediately differentiate themselves from one another to me. TrncateVerbHandler We should probably move MV truncation into cfs.truncateBlocking as MV are children of the CFS, rather than have it being a part of the VerbHandler. StorageProxy Need clearer differentiation between the 2 wrapBatchResponseHandler methods rather than just a comment. If initiating writes is a side-effect of the method, its name should reflect that. I'm also not sure that comment is accurate as the difference between the 2 methods currently appears to be only around naturalEndpoint determination MaterializedViewLongTest ttlTest looks like it should be a regular unit test rather than a long MaterializedViewTest Comment on why you're swallowing exceptions Comment tests to indicate the use-cases they're testing Do we really need to sleep in 1000ms increments on a CREATE MATERIALIZED VIEW statement? (This is more a question on expected performance w/a base CF w/100 records than anything) testAllTypes is going to atrophy when new types are introduced MaterializedViewUtilsTest Needs comments nits Unused imports in Keyspace.java Double-whitespace in Keyspace.apply Unused import in RowUpdateBuilder.java Double-whitespace in StorageProxy.mutateAtomically Remove left over debug logging in StorageProxy.mutateAtomically MaterializedViewLongTest: double whitespacing, unused import MaterializedViewTest: unused imports, double whitespace MaterializedViewUtilsTest: unused imports Things I'd like to see performance tested Pre-MV vs. Post-MV patch, regular read/write/mixed throughput and latency Pre-MV vs. Post-MV patch, batchlog stress Given the changes to non-MV queries on the critical path, I think it's important we at least confirm we're not introducing any obvious performance regressions to those query types when excluding the MV code-paths from apply calls.
          Hide
          jbellis Jonathan Ellis added a comment -

          Re null handling: as I said initially, just leaving out data for a null value is broken. It's not good for a single column, and it's worse for multiple columns. [It sounds like you are saying it treats (A, null) == (null, B) == null which is not what we want either.]

          What if we decided that MV partition key will always be encoded as a tuple? Does that get around the "partition key cannot be null" problem if we have tuple(null) instead?

          (I don't see how NOT NULL works except as an extension of INSERTS ONLY, because otherwise you either have to introduce possibly racy read-before-write or you have to require specifying the NOT NULL column on every UPDATE.)

          Show
          jbellis Jonathan Ellis added a comment - Re null handling: as I said initially, just leaving out data for a null value is broken. It's not good for a single column, and it's worse for multiple columns. [It sounds like you are saying it treats (A, null) == (null, B) == null which is not what we want either.] What if we decided that MV partition key will always be encoded as a tuple? Does that get around the "partition key cannot be null" problem if we have tuple(null) instead? (I don't see how NOT NULL works except as an extension of INSERTS ONLY , because otherwise you either have to introduce possibly racy read-before-write or you have to require specifying the NOT NULL column on every UPDATE.)
          Hide
          tjake T Jake Luciani added a comment -

          just leaving out data for a null value is broken.

          I disagree. For example how do you take the token on a null partition key? (assuming your partition key is a non-PK column)

          What if we decided that MV partition key will always be encoded as a tuple?

          It doesn't solve the above problem and if you used it for clustering keys the query syntax would be non-standard (you can't query by base name anymore)

          I don't see how NOT NULL works except as an extension of INSERTS ONLY

          It wouldn't be any different than how we enforce it for PK today (if column is a PK column it can't be null).
          The only restriction is you can't alter the table later to make a column not NOT NULL. But I'm not sure how tombstones would work, you could only update to a new value.

          Show
          tjake T Jake Luciani added a comment - just leaving out data for a null value is broken. I disagree. For example how do you take the token on a null partition key? (assuming your partition key is a non-PK column) What if we decided that MV partition key will always be encoded as a tuple? It doesn't solve the above problem and if you used it for clustering keys the query syntax would be non-standard (you can't query by base name anymore) I don't see how NOT NULL works except as an extension of INSERTS ONLY It wouldn't be any different than how we enforce it for PK today (if column is a PK column it can't be null). The only restriction is you can't alter the table later to make a column not NOT NULL. But I'm not sure how tombstones would work, you could only update to a new value.
          Hide
          jbellis Jonathan Ellis added a comment -

          how do you take the token on a null partition key

          There are a number of options. Make it the first value in the token space, or the same as the token of empty blob, or the same as the token for a zero-element tuple, for instance. It doesn't really matter since token collisions are acceptable now.

          if you used it for clustering keys the query syntax would be non-standard

          I'm proposing that the encoding should be internal, not exposed to the user. Fundamentally we need to distinguish that (A, null) and (null, B) are different things. Whether we use a tuple or some custom encoding to do that I don't care.

          It wouldn't be any different than how we enforce it for PK today

          Which is exactly the problem, the syntax itself enforces that the PK is specified because that is how the row being updated is identified. But with this we'd have to require the user to add , not_null_column=X as part of the SET clause for every UPDATE.

          Show
          jbellis Jonathan Ellis added a comment - how do you take the token on a null partition key There are a number of options. Make it the first value in the token space, or the same as the token of empty blob, or the same as the token for a zero-element tuple, for instance. It doesn't really matter since token collisions are acceptable now. if you used it for clustering keys the query syntax would be non-standard I'm proposing that the encoding should be internal, not exposed to the user. Fundamentally we need to distinguish that (A, null) and (null, B) are different things. Whether we use a tuple or some custom encoding to do that I don't care. It wouldn't be any different than how we enforce it for PK today Which is exactly the problem, the syntax itself enforces that the PK is specified because that is how the row being updated is identified. But with this we'd have to require the user to add , not_null_column=X as part of the SET clause for every UPDATE.
          Hide
          slebresne Sylvain Lebresne added a comment -

          On the subject of null in primary keys, I'll first note that post-8099 we do support null for clustering keys internally, so we can (and should) use that.

          Partition keys are arguably a bit more of a problem. My own preference would be to actually do CASSANDRA-9796, as I think we should do that sooner or later anyway and we could make null supported there in a somewhat reasonable way.

          I'm proposing that the encoding should be internal, not exposed to the user.

          But MVs are currently directly exposed as read-only tables. So if we start doing something that is internal only and not exposed to the user, this will get hacky pretty quickly, forcing us to intercept the queries in SelectStatement and do whatever translation is necessary. And I can guarantee we'll regret that for the years to come. Which is why I would much prefer a general solution à la CASSANDRA-9796.

          Show
          slebresne Sylvain Lebresne added a comment - On the subject of null in primary keys, I'll first note that post-8099 we do support null for clustering keys internally, so we can (and should) use that. Partition keys are arguably a bit more of a problem. My own preference would be to actually do CASSANDRA-9796 , as I think we should do that sooner or later anyway and we could make null supported there in a somewhat reasonable way. I'm proposing that the encoding should be internal, not exposed to the user. But MVs are currently directly exposed as read-only tables. So if we start doing something that is internal only and not exposed to the user, this will get hacky pretty quickly, forcing us to intercept the queries in SelectStatement and do whatever translation is necessary. And I can guarantee we'll regret that for the years to come. Which is why I would much prefer a general solution à la CASSANDRA-9796 .
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          And I can guarantee we'll regret that for the years to come. Which is why I would much prefer a general solution à la CASSANDRA-9796.

          This is my strong preference as well.

          Show
          iamaleksey Aleksey Yeschenko added a comment - And I can guarantee we'll regret that for the years to come. Which is why I would much prefer a general solution à la CASSANDRA-9796 . This is my strong preference as well.
          Hide
          tjake T Jake Luciani added a comment -

          I'll first note that post-8099 we do support null for clustering keys internally, so we can (and should) use that.

          Oh good! That'll be easy then for clustering keys. (we should note that in 3.0 docs)

          Which is why I would much prefer a general solution à la CASSANDRA-9796.

          Yeah our concern was trying to support something only allowed for MV tables would end up biting us.

          I'm still not sure how handling a null partition key would work though in term of tokens. If you have a sparse column across millions of rows in the base table you would end up with a HUGE hotspot since all NULL keys would end up on the same partition (token X)

          Show
          tjake T Jake Luciani added a comment - I'll first note that post-8099 we do support null for clustering keys internally, so we can (and should) use that. Oh good! That'll be easy then for clustering keys. (we should note that in 3.0 docs) Which is why I would much prefer a general solution à la CASSANDRA-9796 . Yeah our concern was trying to support something only allowed for MV tables would end up biting us. I'm still not sure how handling a null partition key would work though in term of tokens. If you have a sparse column across millions of rows in the base table you would end up with a HUGE hotspot since all NULL keys would end up on the same partition (token X)
          Hide
          slebresne Sylvain Lebresne added a comment -

          I'm still not sure how handling a null partition key would work though in term of tokens. If you have a sparse column across millions of rows in the base table you would end up with a HUGE hotspot since all NULL keys would end up on the same partition (token X)

          We're saying it could work technically. Your point that it may not be a good idea is valid, but I could argue that creating hotspots by choosing poor partition keys is possible with or without nulls. Creating a MV where the partition key is for instance the gender of your users is equally going to create massive hotspots, but it's not like we can forbid people to do it (and if we were to somehow be able to enforce NOT NULL (which we can't) and force people to use it, you can be sure some people just end up using default values and thus generating the exact same problems. There is no such thing as stupid-proofing).

          If we fear nulls make it easier to shoot yourself in the foot, then maybe we can find some kind of limitation that are not too restrictive but make that shoot-footing less likely, but I suspect we'll just have to rely on having good best practice docs.

          Show
          slebresne Sylvain Lebresne added a comment - I'm still not sure how handling a null partition key would work though in term of tokens. If you have a sparse column across millions of rows in the base table you would end up with a HUGE hotspot since all NULL keys would end up on the same partition (token X) We're saying it could work technically. Your point that it may not be a good idea is valid, but I could argue that creating hotspots by choosing poor partition keys is possible with or without nulls. Creating a MV where the partition key is for instance the gender of your users is equally going to create massive hotspots, but it's not like we can forbid people to do it (and if we were to somehow be able to enforce NOT NULL (which we can't) and force people to use it, you can be sure some people just end up using default values and thus generating the exact same problems. There is no such thing as stupid-proofing). If we fear nulls make it easier to shoot yourself in the foot, then maybe we can find some kind of limitation that are not too restrictive but make that shoot-footing less likely, but I suspect we'll just have to rely on having good best practice docs.
          Hide
          brianmhess Brian Hess added a comment -

          Moreover, if you were to handle NULLs in the application tier (say by the application using the string "NULL" or the INT -1 or whatever) you end up with exactly the same hot spot. This is a known issue for distributed systems in general (not just Cassandra), so I would note it and move on - just like we note that GENDER is a bad partition key today.

          Show
          brianmhess Brian Hess added a comment - Moreover, if you were to handle NULLs in the application tier (say by the application using the string "NULL" or the INT -1 or whatever) you end up with exactly the same hot spot. This is a known issue for distributed systems in general (not just Cassandra), so I would note it and move on - just like we note that GENDER is a bad partition key today.
          Hide
          tjake T Jake Luciani added a comment -

          Creating a MV where the partition key is for instance the gender of your users is equally going to create massive hotspots, but it's not like we can forbid people to do it.

          This is an obvious anti-pattern for a materialized view. NULL, however, is not something people think about since it's the default value implicitly for Cassandra. Especially when considering consistency issues in a dynamo system. If you are writing to a table with RF=3 and only one replica has a value but the other two do not then we will make one MV replica have a defined partition key and the other two replicas will have a NULL. Currently we don't allow this so the NULLs are left out.

          Show
          tjake T Jake Luciani added a comment - Creating a MV where the partition key is for instance the gender of your users is equally going to create massive hotspots, but it's not like we can forbid people to do it. This is an obvious anti-pattern for a materialized view. NULL, however, is not something people think about since it's the default value implicitly for Cassandra. Especially when considering consistency issues in a dynamo system. If you are writing to a table with RF=3 and only one replica has a value but the other two do not then we will make one MV replica have a defined partition key and the other two replicas will have a NULL. Currently we don't allow this so the NULLs are left out.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Currently, NULL is used to insert tombstones into the table:

                  // insert exactly the amount of tombstones that shouldn't trigger an exception
                  for (int i = 0; i < THRESHOLD; i++)
                      execute("INSERT INTO %s (a, b, c) VALUES ('key', 'column" + i + "', null);");
          

          This means that we will have to insert a NULL value for all unset columns, as well as all tombstone'd columns, since regular columns are not explicitly null. In addition, we would have to insert new NULL values after a TTL expired.

          Show
          carlyeks Carl Yeksigian added a comment - Currently, NULL is used to insert tombstones into the table: // insert exactly the amount of tombstones that shouldn't trigger an exception for ( int i = 0; i < THRESHOLD; i++) execute( "INSERT INTO %s (a, b, c) VALUES ('key', 'column" + i + "', null );" ); This means that we will have to insert a NULL value for all unset columns, as well as all tombstone'd columns, since regular columns are not explicitly null. In addition, we would have to insert new NULL values after a TTL expired.
          Hide
          slebresne Sylvain Lebresne added a comment -

          NULL, however, is not something people think about since it's the default value implicitly for Cassandra.

          Again, I'm not trying to say that it's not something we should consider. I'm mainly saying that imo we should do CASSANDRA-9796 anyway so we don't have the internal limitation. Maybe we can impose limitations for the sake of protecting against foot-shooting, like requiring that at least one column of the MV partition key is a primary key column of the base table (whithout requiring all of them to be).

          If you are writing to a table with RF=3 and only one replica has a value but the other two do not then we will make one MV replica have a defined partition key and the other two replicas will have a NULL

          Not sure I follow. Nulls should just be "just another value" and there should really be any special casing for them. How is what you say different from writing a new value at RF=3 and only one replica getting the new value?

          Show
          slebresne Sylvain Lebresne added a comment - NULL, however, is not something people think about since it's the default value implicitly for Cassandra. Again, I'm not trying to say that it's not something we should consider. I'm mainly saying that imo we should do CASSANDRA-9796 anyway so we don't have the internal limitation. Maybe we can impose limitations for the sake of protecting against foot-shooting, like requiring that at least one column of the MV partition key is a primary key column of the base table (whithout requiring all of them to be). If you are writing to a table with RF=3 and only one replica has a value but the other two do not then we will make one MV replica have a defined partition key and the other two replicas will have a NULL Not sure I follow. Nulls should just be "just another value" and there should really be any special casing for them. How is what you say different from writing a new value at RF=3 and only one replica getting the new value?
          Hide
          tjake T Jake Luciani added a comment -

          How is what you say different from writing a new value at RF=3 and only one replica getting the new value?

          Because one isn't implicitly a hot spot

          Show
          tjake T Jake Luciani added a comment - How is what you say different from writing a new value at RF=3 and only one replica getting the new value? Because one isn't implicitly a hot spot
          Hide
          slebresne Sylvain Lebresne added a comment -

          Because one isn't implicitly a hot spot

          Ok, gotcha, MV table where the partition key is a single column that is not a PK from the base table will create implicit hot spots. I'm personally happy to disallow that case. That's still a very specific case: MV tables with multiple partition key columns where one of them is a PK in the base table (but not all of them are) don't have this problem, so the the underlying question of how we support that is still open.

          we would have to insert new NULL values after a TTL expired

          Good point. I can definitively see how TTLs are a problem for including non-PK columns from the base table into a MV PK. That said, as far as I can tell that's a problem even with a single column that is not in the original PK. Yet, it's supposed to be supported by the patch. How does it work? What happen once that column expires?

          This means that we will have to insert a NULL value for all unset columns, as well as all tombstone'd columns, since regular columns are not explicitly null.

          That part I don't follow.

          Show
          slebresne Sylvain Lebresne added a comment - Because one isn't implicitly a hot spot Ok, gotcha, MV table where the partition key is a single column that is not a PK from the base table will create implicit hot spots. I'm personally happy to disallow that case. That's still a very specific case: MV tables with multiple partition key columns where one of them is a PK in the base table (but not all of them are) don't have this problem, so the the underlying question of how we support that is still open. we would have to insert new NULL values after a TTL expired Good point. I can definitively see how TTLs are a problem for including non-PK columns from the base table into a MV PK. That said, as far as I can tell that's a problem even with a single column that is not in the original PK. Yet, it's supposed to be supported by the patch. How does it work? What happen once that column expires? This means that we will have to insert a NULL value for all unset columns, as well as all tombstone'd columns, since regular columns are not explicitly null. That part I don't follow.
          Hide
          carlyeks Carl Yeksigian added a comment -

          That said, as far as I can tell that's a problem even with a single column that is not in the original PK. Yet, it's supposed to be supported by the patch. How does it work? What happen once that column expires?

          Because that is the only column which is supported, we TTL the row in the view so that it is taken care of the same as the base TTL. The difference here would be that there would also need to be a new value created which is the NULL that the TTL will result in after it has expired. We would either need special logic to handle a TTL on compaction (and lose some eventual consistency), or generate a future timestamped value for the NULL whenever we create a TTL that will become valid when the TTL expires on the base and view columns.

          That part I don't follow.

          We don't have semantics for regular columns to be set null, instead that indicates that they are tombstones. We can't distinguish between an explicitly set NULL and one that happened because we either tombstoned the columns or we never set the included columns, so we will have to use the NULL value for all unset and tombstoned columns.

          Show
          carlyeks Carl Yeksigian added a comment - That said, as far as I can tell that's a problem even with a single column that is not in the original PK. Yet, it's supposed to be supported by the patch. How does it work? What happen once that column expires? Because that is the only column which is supported, we TTL the row in the view so that it is taken care of the same as the base TTL. The difference here would be that there would also need to be a new value created which is the NULL that the TTL will result in after it has expired. We would either need special logic to handle a TTL on compaction (and lose some eventual consistency), or generate a future timestamped value for the NULL whenever we create a TTL that will become valid when the TTL expires on the base and view columns. That part I don't follow. We don't have semantics for regular columns to be set null, instead that indicates that they are tombstones. We can't distinguish between an explicitly set NULL and one that happened because we either tombstoned the columns or we never set the included columns, so we will have to use the NULL value for all unset and tombstoned columns.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          I've got an issue with how MVs are represented in schema, but, more user-visibly, a minor issue on the DDL side. Or, rather, two issues:

          1. We have CREATE MATERIALIZED VIEW, DROP MATERIALIZED VIEW, but we reuse ALTER TABLE to modify it. We should have ALTER MATERIALIZED VIEW instead.
          2. It's currently impossible to override things like compaction and compression parameters in CREATE MATERIALIZED VIEW. Users shouldn't need to issue a follow up ALTER MATERIALIZED VIEW to set the parameters. CREATE MATERIALIZED VIEW should support WITH.

          Show
          iamaleksey Aleksey Yeschenko added a comment - I've got an issue with how MVs are represented in schema, but, more user-visibly, a minor issue on the DDL side. Or, rather, two issues: 1. We have CREATE MATERIALIZED VIEW , DROP MATERIALIZED VIEW , but we reuse ALTER TABLE to modify it. We should have ALTER MATERIALIZED VIEW instead. 2. It's currently impossible to override things like compaction and compression parameters in CREATE MATERIALIZED VIEW . Users shouldn't need to issue a follow up ALTER MATERIALIZED VIEW to set the parameters. CREATE MATERIALIZED VIEW should support WITH .
          Hide
          slebresne Sylvain Lebresne added a comment -

          Ok, let me back up a minute here. If I understand correctly, the current semantic is that if a view has a non-PK column from the base table, then the MV only has an entry if that non-PK column is not null? In other words, given:

          CREATE TABLE t (k int, t int, v int, PRIMARY KEY (k, t));
          CREATE MATERIALIZED VIEW v AS SELECT * FROM t PRIMARY KEY (t, v, k);
          
          INSERT INTO t (k, t, v) VALUES (0, 0, 0) USING TTL 1000;
          UPDATE t SET v = null WHERE k = 0 AND t = 0;
          

          If you do:

          SELECT * FROM v;
          

          you get no results, even though the view was created with a SELECT * and there is a row in the base table. Is it how this currently work?

          Perhaps more importantly, is it how we want it to work?

          That is, doesn't it feel a bit inconsistent not to have any entry in the view? That typically mean that if you query the view to find how many rows have t == 0, you won't have an accurate answer in particular. Wouldn't it be better to get:

          SELECT * FROM v;
          t | v    | k
          -------------
          0 | null | 0
          

          Because that is the only column which is supported, we TTL the row in the view so that it is taken care of the same as the base TTL.

          Assuming we are fine with the semantic of "the view only has a row if the base class has a row and none of the non-PK columns of the base class that are in the MV Pk are null" (which I'm not sure I am, and is an important decision for sure), then I think we've be fine with more than one column on that front by simply using the min TTL of all the columns.

          Show
          slebresne Sylvain Lebresne added a comment - Ok, let me back up a minute here. If I understand correctly, the current semantic is that if a view has a non-PK column from the base table, then the MV only has an entry if that non-PK column is not null? In other words, given: CREATE TABLE t (k int, t int, v int, PRIMARY KEY (k, t)); CREATE MATERIALIZED VIEW v AS SELECT * FROM t PRIMARY KEY (t, v, k); INSERT INTO t (k, t, v) VALUES (0, 0, 0) USING TTL 1000; UPDATE t SET v = null WHERE k = 0 AND t = 0; If you do: SELECT * FROM v; you get no results, even though the view was created with a SELECT * and there is a row in the base table. Is it how this currently work? Perhaps more importantly, is it how we want it to work? That is, doesn't it feel a bit inconsistent not to have any entry in the view? That typically mean that if you query the view to find how many rows have t == 0 , you won't have an accurate answer in particular. Wouldn't it be better to get: SELECT * FROM v; t | v | k ------------- 0 | null | 0 Because that is the only column which is supported, we TTL the row in the view so that it is taken care of the same as the base TTL. Assuming we are fine with the semantic of "the view only has a row if the base class has a row and none of the non-PK columns of the base class that are in the MV Pk are null" (which I'm not sure I am, and is an important decision for sure), then I think we've be fine with more than one column on that front by simply using the min TTL of all the columns.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Aleksey Yeschenko: I added support to allow for changing the table options with CREATE MATERIALIZED VIEW, so it is possible to set those options when creating the view.

          Sylvain Lebresne:

          Is it how this currently work?

          Yes, this is how it currently works; if we have any NULLs in the view primary key, we do not insert that into the view.

          Changing this would also imply that

          INSERT INTO (k,t) VALUES (0, 0);
          

          would create a row in the view as well for (0, null, 0).

          Assuming we are fine with the semantic of "the view only has a row if the base class has a row and none of the non-PK columns of the base class that are in the MV Pk are null" (which I'm not sure I am, and is an important decision for sure), then I think we've be fine with more than one column on that front by simply using the min TTL of all the columns.

          It works this was under the assumption that we don't have to represent the null columns which are introduced when the current ttl'd columns expire; if that changes, we'll have to redo all of the current ttl work, which is fine as long as we have both a clear idea of the work involved to introduce this, and a real use case we are trying to support.

          Overall, I'd be more surprised that a materialized view is including rows for values that were never set, or which were specifically deleted, not that it isn't including those values.

          Show
          carlyeks Carl Yeksigian added a comment - Aleksey Yeschenko : I added support to allow for changing the table options with CREATE MATERIALIZED VIEW , so it is possible to set those options when creating the view. Sylvain Lebresne : Is it how this currently work? Yes, this is how it currently works; if we have any NULLs in the view primary key, we do not insert that into the view. Changing this would also imply that INSERT INTO (k,t) VALUES (0, 0); would create a row in the view as well for (0, null, 0). Assuming we are fine with the semantic of "the view only has a row if the base class has a row and none of the non-PK columns of the base class that are in the MV Pk are null" (which I'm not sure I am, and is an important decision for sure), then I think we've be fine with more than one column on that front by simply using the min TTL of all the columns. It works this was under the assumption that we don't have to represent the null columns which are introduced when the current ttl'd columns expire; if that changes, we'll have to redo all of the current ttl work, which is fine as long as we have both a clear idea of the work involved to introduce this, and a real use case we are trying to support. Overall, I'd be more surprised that a materialized view is including rows for values that were never set, or which were specifically deleted, not that it isn't including those values.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Changing this would also imply that

          INSERT INTO (k,t) VALUES (0, 0);
          

          would create a row in the view as well for (0, null, 0).

          Absolutely. And to be clear, all I'm saying is, we should make sure we're all in agreement and comfortable with the semantic we implement (and we should be clear on what that semantic is exactly). To have every row in the base table having a counterpart in the MV is something I would have expected. Are we sure we are fine with giving up that property? Or to put it another way, if we do give up that property, I'd like to make sure we understand why we are doing so, and I'm not all that clear on that. Is everyone except me convinced that a null for a column that is part of the MV primary key should mean the row is not in the MV?

          Show
          slebresne Sylvain Lebresne added a comment - Changing this would also imply that INSERT INTO (k,t) VALUES (0, 0); would create a row in the view as well for (0, null, 0). Absolutely. And to be clear, all I'm saying is, we should make sure we're all in agreement and comfortable with the semantic we implement (and we should be clear on what that semantic is exactly). To have every row in the base table having a counterpart in the MV is something I would have expected. Are we sure we are fine with giving up that property? Or to put it another way, if we do give up that property, I'd like to make sure we understand why we are doing so, and I'm not all that clear on that. Is everyone except me convinced that a null for a column that is part of the MV primary key should mean the row is not in the MV?
          Hide
          jbellis Jonathan Ellis added a comment -

          To have every row in the base table having a counterpart in the MV is something I would have expected.

          +10, this is the fundamental proposition that users will expect.

          Show
          jbellis Jonathan Ellis added a comment - To have every row in the base table having a counterpart in the MV is something I would have expected. +10, this is the fundamental proposition that users will expect.
          Hide
          brianmhess Brian Hess added a comment -

          A question to behavior here. If we do the following

          INSERT INTO t (k, t, v) VALUES (0, 0, 0) USING TTL 1000;
          UPDATE t SET v = null WHERE k = 0 AND t = 0;
          INSERT INTO t (k, t) VALUES (1, 1) USING TTL 1000;
          

          Then, doing a

          SELECT * FROM t;
          

          would result in getting both rows: (0, 0, null) and (1, 1, null)

          However, if we have the view also have a null for the first row, so it would be (0, null, 0), wouldn't we also expect to see a row for (1, null, 1)?
          The fact that a cell was non-null and is now null shouldn't really have any different behavior than if it was null from the start.

          Another point - I think the primary use case will be to mimic query tables. That is, the MV will have the same columns in its primary key as the base table's, but with different partition keys and different orders for clustering columns. I could be wrong, but that seems like the biggest use case. And in that use case the keys are all non-null.

          Show
          brianmhess Brian Hess added a comment - A question to behavior here. If we do the following INSERT INTO t (k, t, v) VALUES (0, 0, 0) USING TTL 1000; UPDATE t SET v = null WHERE k = 0 AND t = 0; INSERT INTO t (k, t) VALUES (1, 1) USING TTL 1000; Then, doing a SELECT * FROM t; would result in getting both rows: (0, 0, null) and (1, 1, null) However, if we have the view also have a null for the first row, so it would be (0, null, 0), wouldn't we also expect to see a row for (1, null, 1)? The fact that a cell was non-null and is now null shouldn't really have any different behavior than if it was null from the start. Another point - I think the primary use case will be to mimic query tables. That is, the MV will have the same columns in its primary key as the base table's, but with different partition keys and different orders for clustering columns. I could be wrong, but that seems like the biggest use case. And in that use case the keys are all non-null.
          Hide
          slebresne Sylvain Lebresne added a comment -

          However, if we have the view also have a null for the first row, so it would be (0, null, 0), wouldn't we also expect to see a row for (1, null, 1)?
          The fact that a cell was non-null and is now null shouldn't really have any different behavior than if it was null from the start.

          Certainly. Having a column null is in CQL by definition the equivalent of the column not having value. I'm not in any way suggesting we should make a difference here.

          That is, the MV will have the same columns in its primary key as the base table's, but with different partition keys and different orders for clustering columns. I could be wrong, but that seems like the biggest use case. And in that use case the keys are all non-null.

          The question is if this should be the only supported case. If you read the comments above, some people strongly disagree. So we're debating how it should work when that's not the case and how to support it. Everybody agrees the case you're describing is the easy case.

          Show
          slebresne Sylvain Lebresne added a comment - However, if we have the view also have a null for the first row, so it would be (0, null, 0), wouldn't we also expect to see a row for (1, null, 1)? The fact that a cell was non-null and is now null shouldn't really have any different behavior than if it was null from the start. Certainly. Having a column null is in CQL by definition the equivalent of the column not having value. I'm not in any way suggesting we should make a difference here. That is, the MV will have the same columns in its primary key as the base table's, but with different partition keys and different orders for clustering columns. I could be wrong, but that seems like the biggest use case. And in that use case the keys are all non-null. The question is if this should be the only supported case. If you read the comments above, some people strongly disagree. So we're debating how it should work when that's not the case and how to support it. Everybody agrees the case you're describing is the easy case.
          Hide
          tupshin Tupshin Harper added a comment -

          I find myself disagreeing with the hard requirement that all rows in the table must show up in the materialized views. While it would be nice, I believe that clearly documenting the limitation and providing a couple of reasonable choices is far preferable then encouraging using rope sufficient to hang the user.

          My suggestion:

          • Create a formal notion of NOT NULL columns in the schema that can be applied to a table, irrespective of any MV usage.
          • Columns that are NOT NULL would have the exact same restrictions as PK columns, namely that they need to be included in all inserts and updates (with the possible exception of LWT updates)
          • Document (and warn in cqlsh) the fact that if you create a MV with a PK using a nullable column from the table, then those values will not be in the view

          It seems to me like this is a far less dangerous (and in many ways less surprising) than automatically creating a hotspot in the MV because lots of data with NULLs get added.

          Now with 8099 supporting NULLs for clustering columns, this might only apply to columns that would be a partition key in the MV, and that seems appealing. But I can't talk myself into liking inserting nulls into a MV partition key.

          Show
          tupshin Tupshin Harper added a comment - I find myself disagreeing with the hard requirement that all rows in the table must show up in the materialized views. While it would be nice, I believe that clearly documenting the limitation and providing a couple of reasonable choices is far preferable then encouraging using rope sufficient to hang the user. My suggestion: Create a formal notion of NOT NULL columns in the schema that can be applied to a table, irrespective of any MV usage. Columns that are NOT NULL would have the exact same restrictions as PK columns, namely that they need to be included in all inserts and updates (with the possible exception of LWT updates) Document (and warn in cqlsh) the fact that if you create a MV with a PK using a nullable column from the table, then those values will not be in the view It seems to me like this is a far less dangerous (and in many ways less surprising) than automatically creating a hotspot in the MV because lots of data with NULLs get added. Now with 8099 supporting NULLs for clustering columns, this might only apply to columns that would be a partition key in the MV, and that seems appealing. But I can't talk myself into liking inserting nulls into a MV partition key.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          I don't think this is the right solution, because sometimes denormalizing the null as partition key is the right thing to do. Which means we do need to support using columns that are not marked NOT NULL in MV PK. Adding a warning in that case is appealing but I think history shows that this won't actually help the people who need it. (See: everyone shooting themselves in the foot with ALLOW FILTERING queries.)

          So ultimately, if we're not going to disallow it entirely (and we shouldn't), then we really haven't gained very much. Especially if the band aid is as ugly as this one.

          Show
          jbellis Jonathan Ellis added a comment - - edited I don't think this is the right solution, because sometimes denormalizing the null as partition key is the right thing to do. Which means we do need to support using columns that are not marked NOT NULL in MV PK. Adding a warning in that case is appealing but I think history shows that this won't actually help the people who need it. (See: everyone shooting themselves in the foot with ALLOW FILTERING queries.) So ultimately, if we're not going to disallow it entirely (and we shouldn't), then we really haven't gained very much. Especially if the band aid is as ugly as this one.
          Hide
          mbroecheler Matthias Broecheler added a comment -

          My 2 cents on the null discussion:
          It needs to be supported that rows are ignored in the view if their column values would lead to null values in the primary key of the view. Otherwise, view maintenance becomes impractical for sparsely populated tables. Imagine you have a large table of users but only 10% or so of those have a SSN. If you create a view by SSN you are creating a hotspot which makes it impractical.
          However, there are situations where including null values is perfectly reasonable under the assumption that the combination of the remaining primary key column values leads to a sufficiently high selectivity. Problem is, without the user giving us some sort of information the two cases won't be distinguishable.
          Hence, it may make sense to introduce a WITH NULL or WITHOUT NULL qualifier in the view definition to give the user control over the behavior with WITHOUT NULL being the default option since its safer (i.e. lower likelihood of hotspots).

          Show
          mbroecheler Matthias Broecheler added a comment - My 2 cents on the null discussion: It needs to be supported that rows are ignored in the view if their column values would lead to null values in the primary key of the view. Otherwise, view maintenance becomes impractical for sparsely populated tables. Imagine you have a large table of users but only 10% or so of those have a SSN. If you create a view by SSN you are creating a hotspot which makes it impractical. However, there are situations where including null values is perfectly reasonable under the assumption that the combination of the remaining primary key column values leads to a sufficiently high selectivity. Problem is, without the user giving us some sort of information the two cases won't be distinguishable. Hence, it may make sense to introduce a WITH NULL or WITHOUT NULL qualifier in the view definition to give the user control over the behavior with WITHOUT NULL being the default option since its safer (i.e. lower likelihood of hotspots).
          Hide
          jbellis Jonathan Ellis added a comment -

          It needs to be supported that rows are ignored in the view if their column values would lead to null values in the primary key of the view. Otherwise, view maintenance becomes impractical for sparsely populated tables.

          This is easily supported syntactically without special directives:

          CREATE MATERIALIZED VIEW users_by_ssn AS
          SELECT ...
          FROM users
          WHERE ssn IS NOT NULL
          PRIMARY KEY (ssn, user_id)
          

          Support for WHERE clause in MV is on the roadmap.

          Show
          jbellis Jonathan Ellis added a comment - It needs to be supported that rows are ignored in the view if their column values would lead to null values in the primary key of the view. Otherwise, view maintenance becomes impractical for sparsely populated tables. This is easily supported syntactically without special directives: CREATE MATERIALIZED VIEW users_by_ssn AS SELECT ... FROM users WHERE ssn IS NOT NULL PRIMARY KEY (ssn, user_id) Support for WHERE clause in MV is on the roadmap.
          Hide
          jbellis Jonathan Ellis added a comment -

          ... actually I think that is how we should deal with this for 3.0.

          For forwards compatibility, we should require WHERE x IS NOT NULL for all partition key columns.

          When we actually add support for null partition keys we can lift this requirement. And then we can extend WHERE to be more general as well.

          Show
          jbellis Jonathan Ellis added a comment - ... actually I think that is how we should deal with this for 3.0. For forwards compatibility, we should require WHERE x IS NOT NULL for all partition key columns. When we actually add support for null partition keys we can lift this requirement. And then we can extend WHERE to be more general as well.
          Hide
          jbellis Jonathan Ellis added a comment -

          (Created 9809 for full WHERE support.)

          Show
          jbellis Jonathan Ellis added a comment - (Created 9809 for full WHERE support.)
          Hide
          mbroecheler Matthias Broecheler added a comment -

          +1

          Show
          mbroecheler Matthias Broecheler added a comment - +1
          Hide
          slebresne Sylvain Lebresne added a comment -

          When we actually add support for null partition keys we can lift this requirement.

          I do want to note that there is a technical problem with allowing null in the MV primary keys (partition key or clustering columns alike), and that is TTLs. More precisely, if a column c in the MV PK has a TTL, we would need to update the MV once c expires, replacing the entry by one where c is now null. And a priori, I don't see an even remotely efficient way to deal with that (keeping in mind that we collect expired column lazily, not when they expire).

          That is, I agree that requiring IS NOT NULL for all primary key columns is the safe option for 3.0, but just wanted to note that the TTL problem above seems to me like a big obstacle in removing that requirement.

          Show
          slebresne Sylvain Lebresne added a comment - When we actually add support for null partition keys we can lift this requirement. I do want to note that there is a technical problem with allowing null in the MV primary keys (partition key or clustering columns alike), and that is TTLs. More precisely, if a column c in the MV PK has a TTL, we would need to update the MV once c expires, replacing the entry by one where c is now null . And a priori, I don't see an even remotely efficient way to deal with that (keeping in mind that we collect expired column lazily, not when they expire). That is, I agree that requiring IS NOT NULL for all primary key columns is the safe option for 3.0, but just wanted to note that the TTL problem above seems to me like a big obstacle in removing that requirement.
          Hide
          jbellis Jonathan Ellis added a comment -

          So you're saying that IS NOT NULL should treat TTL'd columns as if they were already null? That is a little confusing but I can live with it.

          Show
          jbellis Jonathan Ellis added a comment - So you're saying that IS NOT NULL should treat TTL'd columns as if they were already null? That is a little confusing but I can live with it.
          Hide
          slebresne Sylvain Lebresne added a comment -

          So you're saying that IS NOT NULL should treat TTL'd columns as if they were already null?

          I'm not, that would be terribly confusing (I personally couldn't live with it). I'm saying that if initially we only support the case when all MV PK columns have IS NOT NULL, then in the presence of TTL, a row should disappear from the MV as soon as the first MV PK column expires. So, as I said earlier, I'm pretty sure we can get by by TTLing the whole MV row with the smallest TTL of a MV PK column (Carl hinted at some problem with that suggestion, but I haven't understood his point. So we should discuss that objection but 'a priori' I think that suggestion work).

          However, I'm also saying that I don't know how to handle that expiring case as soon as we don't have the IS NOT NULL condition on some of the column, because in that case what we should do when a column expire is replacing the existing MV entry by a new one that have a different PK (one with a null for the newly expired column), and that's hard to do.

          Show
          slebresne Sylvain Lebresne added a comment - So you're saying that IS NOT NULL should treat TTL'd columns as if they were already null? I'm not, that would be terribly confusing (I personally couldn't live with it). I'm saying that if initially we only support the case when all MV PK columns have IS NOT NULL , then in the presence of TTL, a row should disappear from the MV as soon as the first MV PK column expires. So, as I said earlier, I'm pretty sure we can get by by TTLing the whole MV row with the smallest TTL of a MV PK column (Carl hinted at some problem with that suggestion, but I haven't understood his point. So we should discuss that objection but 'a priori' I think that suggestion work). However, I'm also saying that I don't know how to handle that expiring case as soon as we don't have the IS NOT NULL condition on some of the column, because in that case what we should do when a column expire is replacing the existing MV entry by a new one that have a different PK (one with a null for the newly expired column), and that's hard to do.
          Hide
          jbellis Jonathan Ellis added a comment -

          Makes sense.

          Show
          jbellis Jonathan Ellis added a comment - Makes sense.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Carl hinted at some problem with that suggestion, but I haven't understood his point. So we should discuss that objection but 'a priori' I think that suggestion work

          I was trying to explain the issue that you are here as well – not how it currently works. It currently is implemented the way that you have described it (TTL on MV PK).

          Show
          carlyeks Carl Yeksigian added a comment - Carl hinted at some problem with that suggestion, but I haven't understood his point. So we should discuss that objection but 'a priori' I think that suggestion work I was trying to explain the issue that you are here as well – not how it currently works. It currently is implemented the way that you have described it (TTL on MV PK).
          Hide
          slebresne Sylvain Lebresne added a comment -

          I was trying to explain the issue that you are here as well – not how it currently works.

          Cool then, sorry the confusion. I guess the one small that is not implemented though, is that you currently only allow a single column that is not a base table PK in the MV pk, while we could allow as many as we want as long as we take the min TTL. Or is that also already implemented?

          Show
          slebresne Sylvain Lebresne added a comment - I was trying to explain the issue that you are here as well – not how it currently works. Cool then, sorry the confusion. I guess the one small that is not implemented though, is that you currently only allow a single column that is not a base table PK in the MV pk, while we could allow as many as we want as long as we take the min TTL. Or is that also already implemented?
          Hide
          jkrupan Jack Krupansky added a comment -

          What CL will apply when MV rows are deleted on TTL expiration?

          Presumably each of the replicas of the base table will have its TTL expiration triggering roughly at the same time, each local change presumably triggering a delete of the MV, but the MV has replicas as well.

          Maybe ANY is reasonable for CL for MV update on TTL since the app is not performing an explicit operation with explicit expectations.

          Show
          jkrupan Jack Krupansky added a comment - What CL will apply when MV rows are deleted on TTL expiration? Presumably each of the replicas of the base table will have its TTL expiration triggering roughly at the same time, each local change presumably triggering a delete of the MV, but the MV has replicas as well. Maybe ANY is reasonable for CL for MV update on TTL since the app is not performing an explicit operation with explicit expectations.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Right now, it is doing the max instead of min, but I think that I had gotten them mixed up, so I'll make sure it is correct.

          TTL expiration is automatic, as they are known from the update, and are just included with the MV update. The discussion was on how, after a TTL expires and we are persisting null values, it should be handled.

          Show
          carlyeks Carl Yeksigian added a comment - Right now, it is doing the max instead of min, but I think that I had gotten them mixed up, so I'll make sure it is correct. TTL expiration is automatic, as they are known from the update, and are just included with the MV update. The discussion was on how, after a TTL expires and we are persisting null values, it should be handled.
          Hide
          slebresne Sylvain Lebresne added a comment -

          The discussion was on how, after a TTL expires and we are persisting null values, it should be handled.

          I'd argue there has been more than one discussion

          All I want is to make sure we agree the following changes should be made to the patch of this ticket (if they hasn't been done already):

          1. lift the limitation of only "a single column that was not part of your original primary key".
          2. force users to include all columns from the original PK in the MV PK. In other words, I agree with Jonathan's remark above that "we shouldn't silently change the definition from what was given".
          3. require WHERE x IS NOT NULL for every column in the MV primary key.
          Show
          slebresne Sylvain Lebresne added a comment - The discussion was on how, after a TTL expires and we are persisting null values, it should be handled. I'd argue there has been more than one discussion All I want is to make sure we agree the following changes should be made to the patch of this ticket (if they hasn't been done already): lift the limitation of only "a single column that was not part of your original primary key". force users to include all columns from the original PK in the MV PK. In other words, I agree with Jonathan's remark above that "we shouldn't silently change the definition from what was given". require WHERE x IS NOT NULL for every column in the MV primary key.
          Hide
          jbellis Jonathan Ellis added a comment -

          +1

          Show
          jbellis Jonathan Ellis added a comment - +1
          Hide
          carlyeks Carl Yeksigian added a comment -

          +1, but I will address all of the comments from Josh first before making these changes.

          Show
          carlyeks Carl Yeksigian added a comment - +1, but I will address all of the comments from Josh first before making these changes.
          Hide
          jkrupan Jack Krupansky added a comment -

          force users to include all columns from the original PK in the MV PK.

          I don't follow the rationale and that seems over-limiting. For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right?

          Show
          jkrupan Jack Krupansky added a comment - force users to include all columns from the original PK in the MV PK. I don't follow the rationale and that seems over-limiting. For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right?
          Hide
          jjordan Jeremiah Jordan added a comment -

          I don't follow the rationale and that seems over-limiting. For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right?

          You have to have all the PK columns from the base table in the clustering of the MV so that you can track deletions/updates correctly.

          Show
          jjordan Jeremiah Jordan added a comment - I don't follow the rationale and that seems over-limiting. For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right? You have to have all the PK columns from the base table in the clustering of the MV so that you can track deletions/updates correctly.
          Hide
          slebresne Sylvain Lebresne added a comment -

          For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right?

          That's correct. We have no way of guaranteeing that your "just name or address" uniquely identifies a row in the base table (and most probably, it doesn't). So what would we do when multiple entries in the base class conflict in the MV? The only option would be to keep the most recent entry, but that would be confusing as hell and we don't want to do that (at the very least, not in that initial ticket). And while we could allows some form of aggregation in that case, that is also definitively at best a future feature.

          But more importantly, it's not really limiting at all. If you want to have a MV to search users by names for instance, just put the name first in the MV PK. To have the rest of the column in the PK won't prevent you for doing your search by name, it'll just guarantee you that you do see all your using have that name, not just the last one updated.

          Show
          slebresne Sylvain Lebresne added a comment - For example, if my base table was id, name, and address, with id as the PK, I couldn't have MV with just name or address as the PK key according to this requirement, right? That's correct. We have no way of guaranteeing that your "just name or address" uniquely identifies a row in the base table (and most probably, it doesn't). So what would we do when multiple entries in the base class conflict in the MV? The only option would be to keep the most recent entry, but that would be confusing as hell and we don't want to do that (at the very least, not in that initial ticket). And while we could allows some form of aggregation in that case, that is also definitively at best a future feature. But more importantly, it's not really limiting at all. If you want to have a MV to search users by names for instance, just put the name first in the MV PK. To have the rest of the column in the PK won't prevent you for doing your search by name, it'll just guarantee you that you do see all your using have that name, not just the last one updated.
          Hide
          brianmhess Brian Hess added a comment -

          Just for clarity Sylvain Lebresne - you'd have to make the first name the Partition Key and the rest of the Primary Keys of the base table as Clustering Columns, correct?

          Show
          brianmhess Brian Hess added a comment - Just for clarity Sylvain Lebresne - you'd have to make the first name the Partition Key and the rest of the Primary Keys of the base table as Clustering Columns, correct?
          Hide
          slebresne Sylvain Lebresne added a comment -

          [brianmhess] Yes typically.

          And to clarify, I'm not saying lifting that constraint is impossible or unthinkable, just that we shouldn't lift it before having made sure it's 1) actually useful and 2) not gonna limit us in the future.

          Show
          slebresne Sylvain Lebresne added a comment - [brianmhess] Yes typically. And to clarify, I'm not saying lifting that constraint is impossible or unthinkable, just that we shouldn't lift it before having made sure it's 1) actually useful and 2) not gonna limit us in the future.
          Hide
          tjake T Jake Luciani added a comment - - edited

          We've made some performance related changes to the write path. I'll walk through the results.
          (thanks to Alan Boudreault for running these benchmarks)

          • Fixed timeout bug (this caused all the timeout exceptions we were seeing)
          • Removed the need for a Coordinator batchlog when running RF=1
          • Made the Base -> View batchlog update local only. Since each base replica is paired to a view replica we only need to guarantee durability of the batchlog on the local base node.
          • Made the base -> view mutations async. Once we write to the local batchlog, we don't care if the actual mutations are sent, it's best effort. So we can fire and forget these and update the base memtable.

          The performance hit is caused mostly 50% Batchlog and 50% Read before write.

          TL;DR anything we can do to make BL or reads faster in trunk we will get improvements on MV writes....

          I'll start with the graph that I think is most relevant. RF=3 Batchlog vs RF=3 Materialized View

          http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=mv-2015-07-15%2Fcassandra-mv-vs-si-benchmark-rf3-batchlog.log&metric=op_rate&operation=3_user_logged&smoothing=1&show_aggregates=true&xmin=0&xmax=1092.08&ymin=0&ymax=20973.7

          50% slower than 2i. >60% slower than regular writes


          This next graph shows RF=1 results (These are not a realistic benchmark IMO but these are the numbers)

          http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=mv-2015-07-15%2Fcassandra-mv-vs-si-benchmark-rf1.log&metric=op_rate&operation=3_user&smoothing=1&show_aggregates=true&xmin=0&xmax=621.28&ymin=0&ymax=131803.1

          50% slower than 2i. >60% slower than regular writes

          Show
          tjake T Jake Luciani added a comment - - edited We've made some performance related changes to the write path. I'll walk through the results. (thanks to Alan Boudreault for running these benchmarks) Fixed timeout bug (this caused all the timeout exceptions we were seeing) Removed the need for a Coordinator batchlog when running RF=1 Made the Base -> View batchlog update local only. Since each base replica is paired to a view replica we only need to guarantee durability of the batchlog on the local base node. Made the base -> view mutations async. Once we write to the local batchlog, we don't care if the actual mutations are sent, it's best effort. So we can fire and forget these and update the base memtable. The performance hit is caused mostly 50% Batchlog and 50% Read before write. TL;DR anything we can do to make BL or reads faster in trunk we will get improvements on MV writes.... I'll start with the graph that I think is most relevant. RF=3 Batchlog vs RF=3 Materialized View http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=mv-2015-07-15%2Fcassandra-mv-vs-si-benchmark-rf3-batchlog.log&metric=op_rate&operation=3_user_logged&smoothing=1&show_aggregates=true&xmin=0&xmax=1092.08&ymin=0&ymax=20973.7 50% slower than 2i. >60% slower than regular writes This next graph shows RF=1 results (These are not a realistic benchmark IMO but these are the numbers) http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=mv-2015-07-15%2Fcassandra-mv-vs-si-benchmark-rf1.log&metric=op_rate&operation=3_user&smoothing=1&show_aggregates=true&xmin=0&xmax=621.28&ymin=0&ymax=131803.1 50% slower than 2i. >60% slower than regular writes
          Hide
          tjake T Jake Luciani added a comment -

          I think a better test will be if we do a BL of size 5 vs no BL with 5 MV, will run that next

          Show
          tjake T Jake Luciani added a comment - I think a better test will be if we do a BL of size 5 vs no BL with 5 MV, will run that next
          Hide
          jbellis Jonathan Ellis added a comment -

          Removed the need for a Coordinator batchlog when running RF=1

          Why do we need this at all? Since replicas are in charge of updating MV then normal hints should perform the same function as batchlog except without the performance hint in the normal case.

          Show
          jbellis Jonathan Ellis added a comment - Removed the need for a Coordinator batchlog when running RF=1 Why do we need this at all? Since replicas are in charge of updating MV then normal hints should perform the same function as batchlog except without the performance hint in the normal case.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Why do we need this at all? Since replicas are in charge of updating MV then normal hints should perform the same function as batchlog except without the performance hint in the normal case.

          Allow me to sum up how we deal with consistency guarantees, why we do it this way and why I don't think hints work. I'm sorry if that response is a bit verbose but as this is the most important thing of this ticket imo, I think it bears repeating and making sure we're all on the same page.

          The main guarantee we have to provide here is that MV are eventually consistent with their base table. In other words, whatever failure scenarios we run into, we should never have an inconsistency that never gets resolved. The canonical example of why this is not a given is we have a column c = 2 in the base table that is also in a MV PK, and we have 2 concurrent updates A (sets c = 3) and B (sets c = 4). Without any kind of protection, we could end up with the MV permanently having 2 entries, one of A and one for B, which is incorrect (which should eventually converge to the update that has the biggest timestamp since that's what the base table will keep). To the best of my knowledge, there is 2 fundamental components to avoiding such permanent inconsistency in the currently written patch/approach:

          1. On each replica, we synchronize/serialize the read-before-write done on the base table. This guarantees that we won't have A and B racing on a single base-table replica. Or, in other words, if the same replica sees both update (where "sees" means "do the read-before-write-and-update-MV-accordingly" dance), then it will properly update the MV. And since each base-table replica updates all MV-table replica, it's enough that a single base-table replica sees both update to guarantee eventually consistent of the MV. But we do need to guarantee at least one such base-table replica sees both updates and that's the 2nd component.
          2. To provided that latter guarantee, we first put each base-table update that include MV updates in the batchlog on the coordinator, and we only remove it from the batchlog once a QUORUM of replica have aknowledged the write (this is importantly not dependent of the CL, eventual consistency must be guaranteed whatever CL you use). That guarantees us that until a QUORUM of replica have seen the update, we'll keep replaying it, which in turns guarantees us that for any 2 updates, at least one replica will have "sees" them both.

          Now, the latter guarantee cannot be provided by hints because we can't guarantee hints delivery in face of failures. Typically, if I write hints on a node and that node dies in a fire before that hint it delivered, it will never be delivered. We need a distributed hint mechanism if you will, and that's what the batch log gives us.

          Show
          slebresne Sylvain Lebresne added a comment - Why do we need this at all? Since replicas are in charge of updating MV then normal hints should perform the same function as batchlog except without the performance hint in the normal case. Allow me to sum up how we deal with consistency guarantees, why we do it this way and why I don't think hints work. I'm sorry if that response is a bit verbose but as this is the most important thing of this ticket imo, I think it bears repeating and making sure we're all on the same page. The main guarantee we have to provide here is that MV are eventually consistent with their base table. In other words, whatever failure scenarios we run into, we should never have an inconsistency that never gets resolved. The canonical example of why this is not a given is we have a column c = 2 in the base table that is also in a MV PK, and we have 2 concurrent updates A (sets c = 3 ) and B (sets c = 4 ). Without any kind of protection, we could end up with the MV permanently having 2 entries, one of A and one for B, which is incorrect (which should eventually converge to the update that has the biggest timestamp since that's what the base table will keep). To the best of my knowledge, there is 2 fundamental components to avoiding such permanent inconsistency in the currently written patch/approach: On each replica, we synchronize/serialize the read-before-write done on the base table. This guarantees that we won't have A and B racing on a single base-table replica. Or, in other words, if the same replica sees both update (where "sees" means "do the read-before-write-and-update-MV-accordingly" dance), then it will properly update the MV. And since each base-table replica updates all MV-table replica, it's enough that a single base-table replica sees both update to guarantee eventually consistent of the MV. But we do need to guarantee at least one such base-table replica sees both updates and that's the 2nd component. To provided that latter guarantee, we first put each base-table update that include MV updates in the batchlog on the coordinator, and we only remove it from the batchlog once a QUORUM of replica have aknowledged the write (this is importantly not dependent of the CL, eventual consistency must be guaranteed whatever CL you use). That guarantees us that until a QUORUM of replica have seen the update, we'll keep replaying it, which in turns guarantees us that for any 2 updates, at least one replica will have "sees" them both. Now, the latter guarantee cannot be provided by hints because we can't guarantee hints delivery in face of failures. Typically, if I write hints on a node and that node dies in a fire before that hint it delivered, it will never be delivered. We need a distributed hint mechanism if you will, and that's what the batch log gives us.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Made the base -> view mutations async. Once we write to the local batchlog, we don't care if the actual mutations are sent, it's best effort. So we can fire and forget these and update the base memtable.

          That's correct from an eventual consistency point of view, but I'm pretty sure this breaks the CL guarantees for the user. What we want is that if I write at CL.QUORUM the base table, and then read my MV at CL.QUORUM, then I'm guaranteed to see my previous update. But that requires that each replica does synchronous updates to the MV, and with the user CL. Writing a local batchlog is not enough in particular since it doesn't give any kind of guarantee of the visibility of the update. See my next comment though on that local batchlog.

          Made the Base -> View batchlog update local only

          I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? If my reasoning above is correct, the coordinator batchlog is enough to guarantee durability and eventual consistency because we will replay the whole mutation until a QUORUM of replica acknowledges success.

          Show
          slebresne Sylvain Lebresne added a comment - Made the base -> view mutations async. Once we write to the local batchlog, we don't care if the actual mutations are sent, it's best effort. So we can fire and forget these and update the base memtable. That's correct from an eventual consistency point of view, but I'm pretty sure this breaks the CL guarantees for the user. What we want is that if I write at CL.QUORUM the base table, and then read my MV at CL.QUORUM , then I'm guaranteed to see my previous update. But that requires that each replica does synchronous updates to the MV, and with the user CL. Writing a local batchlog is not enough in particular since it doesn't give any kind of guarantee of the visibility of the update. See my next comment though on that local batchlog. Made the Base -> View batchlog update local only I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? If my reasoning above is correct, the coordinator batchlog is enough to guarantee durability and eventual consistency because we will replay the whole mutation until a QUORUM of replica acknowledges success.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          From a user's perspective, I agree with Sylvain that the MV should respect the CL. I wouldn't expect to do a write at ALL, then do a read and get an old record back.

          Show
          rustyrazorblade Jon Haddad added a comment - From a user's perspective, I agree with Sylvain that the MV should respect the CL. I wouldn't expect to do a write at ALL, then do a read and get an old record back.
          Hide
          brianmhess Brian Hess added a comment -

          +1 I think that is the "promise" of the MV.

          Show
          brianmhess Brian Hess added a comment - +1 I think that is the "promise" of the MV.
          Hide
          benedict Benedict added a comment -

          I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it?

          The thing is, the coordinator-level batchlog write is quite expensive. It seems we've paired each node with one MV node, but here's an idea: why not also pair it with RF-2 (or 1, and only support RF=3 for now) partners, to whom it requires the first write to be propagated, without which it does not acknowledge? This could be done with a specialised batchlog write, that goes to the local node and the paired MV node. That way, most importantly, we do not have to wait synchronously for the batchlog records to be written: if they're lost, then the corruption caused by their loss is also lost.

          Show
          benedict Benedict added a comment - I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? The thing is, the coordinator-level batchlog write is quite expensive. It seems we've paired each node with one MV node, but here's an idea: why not also pair it with RF-2 (or 1, and only support RF=3 for now) partners, to whom it requires the first write to be propagated, without which it does not acknowledge? This could be done with a specialised batchlog write, that goes to the local node and the paired MV node. That way, most importantly, we do not have to wait synchronously for the batchlog records to be written: if they're lost, then the corruption caused by their loss is also lost.
          Hide
          tjake T Jake Luciani added a comment -

          I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? If my reasoning above is correct, the coordinator batchlog is enough to guarantee durability and eventual consistency because we will replay the whole mutation until a QUORUM of replica acknowledges success.

          Yes, if we error out if the base is unable to replicate to the view then the second BL is redundant. However there are a few reasons why we did what we did.

          1. Your availability is cut in half when you use a MV with these guarantees. I have a 5 node cluster RF=3 and I want to write at CL.ONE. If I have an MV I can no longer handle two down nodes. Since the paired replica for the one base node might be down.
          2. The cost of replaying the coordinator BL is much higher than replaying the base to replica BL since the latter is 1:1.

          I do agree there is a disconnect in terms of consistency level when using the MV but the batchlog feature was written to handle this. We could support both approaches in terms of a new flag? Or are we willing to take a hit on availability?

          Show
          tjake T Jake Luciani added a comment - I've actually never understood why we do a batchlog update on the base table replicas (and so I think we should remove it, even though that's likely not the most costly one). Why do we need it? If my reasoning above is correct, the coordinator batchlog is enough to guarantee durability and eventual consistency because we will replay the whole mutation until a QUORUM of replica acknowledges success. Yes, if we error out if the base is unable to replicate to the view then the second BL is redundant. However there are a few reasons why we did what we did. 1. Your availability is cut in half when you use a MV with these guarantees. I have a 5 node cluster RF=3 and I want to write at CL.ONE. If I have an MV I can no longer handle two down nodes. Since the paired replica for the one base node might be down. 2. The cost of replaying the coordinator BL is much higher than replaying the base to replica BL since the latter is 1:1. I do agree there is a disconnect in terms of consistency level when using the MV but the batchlog feature was written to handle this. We could support both approaches in terms of a new flag? Or are we willing to take a hit on availability?
          Hide
          benedict Benedict added a comment -

          We could support both approaches in terms of a new flag?

          I think permitting faster operation with only "eventual consistency" guarantees is a good thing, since most users doing their own denormalisation probably get no better than that. A flag on construction?

          Show
          benedict Benedict added a comment - We could support both approaches in terms of a new flag? I think permitting faster operation with only "eventual consistency" guarantees is a good thing, since most users doing their own denormalisation probably get no better than that. A flag on construction?
          Hide
          jbellis Jonathan Ellis added a comment -

          From a user's perspective, I agree with Sylvain that the MV should respect the CL. I wouldn't expect to do a write at ALL, then do a read and get an old record back.

          But the other side of that coin is is, we're effectively promoting all operations to at least QUORUM regardless of what the user asked for...

          Show
          jbellis Jonathan Ellis added a comment - From a user's perspective, I agree with Sylvain that the MV should respect the CL. I wouldn't expect to do a write at ALL, then do a read and get an old record back. But the other side of that coin is is, we're effectively promoting all operations to at least QUORUM regardless of what the user asked for...
          Hide
          slebresne Sylvain Lebresne added a comment -

          But the other side of that coin is is, we're effectively promoting all operations to at least QUORUM regardless of what the user asked for...

          We're not. In the description I made above, we need to wait on QUORUM response to remove from the batchlog, but we don't need to wait on QUORUM to respond to the user. Unless my reasoning is broken, we do respect the CL levels exactly as we should.

          Show
          slebresne Sylvain Lebresne added a comment - But the other side of that coin is is, we're effectively promoting all operations to at least QUORUM regardless of what the user asked for... We're not. In the description I made above, we need to wait on QUORUM response to remove from the batchlog, but we don't need to wait on QUORUM to respond to the user. Unless my reasoning is broken, we do respect the CL levels exactly as we should.
          Hide
          jbellis Jonathan Ellis added a comment -

          1. Paired replica? What?

          2. Under what conditions does replica BL save you from replaying coordinator BL?

          Show
          jbellis Jonathan Ellis added a comment - 1. Paired replica? What? 2. Under what conditions does replica BL save you from replaying coordinator BL?
          Hide
          jbellis Jonathan Ellis added a comment -

          Pedantically you are correct. Which is why I said "effectively" and not "literally."

          Show
          jbellis Jonathan Ellis added a comment - Pedantically you are correct. Which is why I said "effectively" and not "literally."
          Hide
          tjake T Jake Luciani added a comment -

          1. This goes way back to benedicts main concept https://issues.apache.org/jira/browse/CASSANDRA-6477?focusedCommentId=14039757&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039757

          We have each replica on the base table send the mv update to a single mv replica. So replicas are "paired" 1:1

          2. Since the coordinator is a BL against a QUORUM of all base replicas which will always send to MV replicas we have a lot more work todo than a only sending a failed base to view update.

          Show
          tjake T Jake Luciani added a comment - 1. This goes way back to benedicts main concept https://issues.apache.org/jira/browse/CASSANDRA-6477?focusedCommentId=14039757&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14039757 We have each replica on the base table send the mv update to a single mv replica. So replicas are "paired" 1:1 2. Since the coordinator is a BL against a QUORUM of all base replicas which will always send to MV replicas we have a lot more work todo than a only sending a failed base to view update.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Pedantically you are correct. Which is why I said "effectively" and not "literally."

          Well, I mean, CL is always more about "when does we answer the client" than "what amount of work we do internally". Every write is always written to every replica for instance, the CL is just a matter of how long we wait before answering the client. I'm arguing this is very exactly the case here too. Anyway, your "the other side of that coin" made it sounds like we were doing unusual regarding the CL, something that may not be desirable. I don't understand what that is if that's the case.

          Show
          slebresne Sylvain Lebresne added a comment - Pedantically you are correct. Which is why I said "effectively" and not "literally." Well, I mean, CL is always more about "when does we answer the client" than "what amount of work we do internally". Every write is always written to every replica for instance, the CL is just a matter of how long we wait before answering the client. I'm arguing this is very exactly the case here too. Anyway, your "the other side of that coin" made it sounds like we were doing unusual regarding the CL, something that may not be desirable. I don't understand what that is if that's the case.
          Hide
          slebresne Sylvain Lebresne added a comment -

          I don't really think the cost of replaying the coordinator BL matters that much. We'll only if less than a quorum of node don't answer a particular query, which should be pretty rare unless you have bigger problems with your cluster. And given the local BL has a cost on every write, even if small, I don't think that from a performance perspective a local BL is a win.

          That said, I hadn't seen we'd decided to go with pairing of base replica to MV replica. Doing so does justify a local BL (another option has always been to fan out to every MV replica, and since this ticket desperately miss a good description of what exact algorithm is actually implemented, I wasn't sure which option we went with).

          Show
          slebresne Sylvain Lebresne added a comment - I don't really think the cost of replaying the coordinator BL matters that much. We'll only if less than a quorum of node don't answer a particular query, which should be pretty rare unless you have bigger problems with your cluster. And given the local BL has a cost on every write, even if small, I don't think that from a performance perspective a local BL is a win. That said, I hadn't seen we'd decided to go with pairing of base replica to MV replica. Doing so does justify a local BL (another option has always been to fan out to every MV replica, and since this ticket desperately miss a good description of what exact algorithm is actually implemented, I wasn't sure which option we went with).
          Hide
          slebresne Sylvain Lebresne added a comment -

          why not also pair it with RF-2 (or 1, and only support RF=3 for now) partners, to whom it requires the first write to be propagated, without which it does not acknowledge? This could be done with a specialised batchlog write, that goes to the local node and the paired MV node.

          I think I get a vague idea of what you mean but I'm not fully sure (and I'm not fully sure it's practical).

          So lets first make sure I understand. Is the suggestion that to guarantee that if base-table replica applies an update, then RF/2 other ones also do it, we'd send the update to all base table replicas "normally" (without coordinator batchlog), but each replica would 1) write the update to a local-only batchlog and 2) forward the update to RF/2 other base table replicas?

          Show
          slebresne Sylvain Lebresne added a comment - why not also pair it with RF-2 (or 1, and only support RF=3 for now) partners, to whom it requires the first write to be propagated, without which it does not acknowledge? This could be done with a specialised batchlog write, that goes to the local node and the paired MV node. I think I get a vague idea of what you mean but I'm not fully sure (and I'm not fully sure it's practical). So lets first make sure I understand. Is the suggestion that to guarantee that if base-table replica applies an update, then RF/2 other ones also do it, we'd send the update to all base table replicas "normally" (without coordinator batchlog), but each replica would 1) write the update to a local-only batchlog and 2) forward the update to RF/2 other base table replicas?
          Hide
          tjake T Jake Luciani added a comment -

          We'll only if less than a quorum of node don't answer a particular query, which should be pretty rare unless you have bigger problems with your cluster.

          That said, I hadn't seen we'd decided to go with pairing of base replica to MV replica.

          If we replicate to every MV replica from every base replica the write amplification gets much worse causing more timeouts. So it makes sense to have replication paired.

          I do think waiting for the MV updates to be synchronous will cause a lot more timeouts and write latency (on top of what we have now). But if it's optional then people can choose.

          Show
          tjake T Jake Luciani added a comment - We'll only if less than a quorum of node don't answer a particular query, which should be pretty rare unless you have bigger problems with your cluster. That said, I hadn't seen we'd decided to go with pairing of base replica to MV replica. If we replicate to every MV replica from every base replica the write amplification gets much worse causing more timeouts. So it makes sense to have replication paired. I do think waiting for the MV updates to be synchronous will cause a lot more timeouts and write latency (on top of what we have now). But if it's optional then people can choose.
          Hide
          jbellis Jonathan Ellis added a comment -

          No, you're right. Synchronous MV updates is a terrible idea, which is more obvious when considering the case of more than one MV. In the extreme case you could touch every node in the cluster.

          Show
          jbellis Jonathan Ellis added a comment - No, you're right. Synchronous MV updates is a terrible idea, which is more obvious when considering the case of more than one MV. In the extreme case you could touch every node in the cluster.
          Hide
          jbellis Jonathan Ellis added a comment -

          If there are multiple MVs being updated, do they get merged into a single set of batchlogs? (I.e. Just one on coordinator, one on each base replica, instead of one per MV.)

          Show
          jbellis Jonathan Ellis added a comment - If there are multiple MVs being updated, do they get merged into a single set of batchlogs? (I.e. Just one on coordinator, one on each base replica, instead of one per MV.)
          Hide
          slebresne Sylvain Lebresne added a comment -

          So it makes sense to have replication paired.

          Sure, didn't implied otherwise, I just wasn't aware we were doing it.

          I do think waiting for the MV updates to be synchronous will cause a lot more timeouts and write latency (on top of what we have now).But if it's optional then people can choose.

          Frankly, I'm pretty negative on adding such option. I think there is some basic guarantees that shouldn't be optional, and the CL ones are amongst those. Making it optional will have people shoot themselves in the foo all the time. At the very least, I would aks that we don't include such option on this ticket (there is enough stuff to deal with) and open a separate ticket to discuss it (one on which we'd actually benchmark thinks before assuming there will be timeouts).

          Show
          slebresne Sylvain Lebresne added a comment - So it makes sense to have replication paired. Sure, didn't implied otherwise, I just wasn't aware we were doing it. I do think waiting for the MV updates to be synchronous will cause a lot more timeouts and write latency (on top of what we have now).But if it's optional then people can choose. Frankly, I'm pretty negative on adding such option. I think there is some basic guarantees that shouldn't be optional, and the CL ones are amongst those. Making it optional will have people shoot themselves in the foo all the time. At the very least, I would aks that we don't include such option on this ticket (there is enough stuff to deal with) and open a separate ticket to discuss it (one on which we'd actually benchmark thinks before assuming there will be timeouts).
          Hide
          carlyeks Carl Yeksigian added a comment -

          There will be one BL entry on the coordinator, and one on each base replica; all MV updates will be batched together on those replicas.

          Show
          carlyeks Carl Yeksigian added a comment - There will be one BL entry on the coordinator, and one on each base replica; all MV updates will be batched together on those replicas.
          Hide
          jkrupan Jack Krupansky added a comment -

          multiple MVs being updated

          It would be good to get a handle on what the scalability of MVs per base table is in terms of recommended best practice. Hundreds? Thousands? A few dozen? Maybe just a handful, like 5 or 10 or a dozen?

          I hate it when a feature like this gets implemented without scalability in mind and then some poor/idiot user comes along and tries a use case which is way out of line with the implemented architecture but we provide no guidance as to what the practical limits really are (e.g., number of tables - thousands vs. hundreds.)

          It seems to me that the primary use case is for query tables, where an app might typically have a handful of queries and probably not more than a small number of dozens in even extreme cases.

          In any case, it would be great to be clear about the design limit for number of MVs per base table - and to make sure some testing gets done to assure that the number is practical.

          And by design limit I don't mean a hard limit where more will cause an explicit error, but where performance is considered acceptable.

          Are the MV updates occurring in parallel with each other, or are they serial? How many MVs could a base table have before the MV updates effectively become serialized?

          Show
          jkrupan Jack Krupansky added a comment - multiple MVs being updated It would be good to get a handle on what the scalability of MVs per base table is in terms of recommended best practice. Hundreds? Thousands? A few dozen? Maybe just a handful, like 5 or 10 or a dozen? I hate it when a feature like this gets implemented without scalability in mind and then some poor/idiot user comes along and tries a use case which is way out of line with the implemented architecture but we provide no guidance as to what the practical limits really are (e.g., number of tables - thousands vs. hundreds.) It seems to me that the primary use case is for query tables, where an app might typically have a handful of queries and probably not more than a small number of dozens in even extreme cases. In any case, it would be great to be clear about the design limit for number of MVs per base table - and to make sure some testing gets done to assure that the number is practical. And by design limit I don't mean a hard limit where more will cause an explicit error, but where performance is considered acceptable. Are the MV updates occurring in parallel with each other, or are they serial? How many MVs could a base table have before the MV updates effectively become serialized?
          Hide
          tupshin Tupshin Harper added a comment -

          Just a reminder (since it was a loong time ago in this ticket), that we were going to target immediate consistency once we could leverage RAMP, and not before.

          Show
          tupshin Tupshin Harper added a comment - Just a reminder (since it was a loong time ago in this ticket), that we were going to target immediate consistency once we could leverage RAMP, and not before.
          Hide
          carlyeks Carl Yeksigian added a comment -

          Tupshin Harper Because we are no longer implementing this as a non-denormalized global index, we don't have multiple partitions to read, so RAMP unfortunately won't solve problems in a materialized view.

          Show
          carlyeks Carl Yeksigian added a comment - Tupshin Harper Because we are no longer implementing this as a non-denormalized global index, we don't have multiple partitions to read, so RAMP unfortunately won't solve problems in a materialized view.
          Hide
          tjake T Jake Luciani added a comment -

          Frankly, I'm pretty negative on adding such option.

          But then why do we even offer the batchlog at all? Hand rolled materialized views use them. And if you feel we should guarantee a consistency level then you would never use a batchlog. Since any timeout would mean you didn't achieve your consistency level and you must retry.

          If you are talking about just the UE then we could check the MV replica UP/Down status in the coordinator as well as the base.

          Show
          tjake T Jake Luciani added a comment - Frankly, I'm pretty negative on adding such option. But then why do we even offer the batchlog at all? Hand rolled materialized views use them. And if you feel we should guarantee a consistency level then you would never use a batchlog. Since any timeout would mean you didn't achieve your consistency level and you must retry. If you are talking about just the UE then we could check the MV replica UP/Down status in the coordinator as well as the base.
          Hide
          slebresne Sylvain Lebresne added a comment -

          The batchlog is to guarantee eventual consistency of the MV. The user CL is mostly to get guarantee that you can see your writes. I don't understand why you'd think the former would prevent the latter.

          But anyway, I'll retract part of my previous comment. I'm not really that oppose to having the option of doing asynchronous MV update. Rather I feel that 1) doing synchonous updates (which is equivalent of respecting the CL for the MV update too) should be the default because that's what people will expect (as exemplified by Jon and Brian response above) and 2) that we should discuss the addition of such option in a followup ticket. The reason for the later is that I think there is multiple ways to do this, and I typically don't like the idea of making a table option at all, so I don't see the point of discussing this here.

          Show
          slebresne Sylvain Lebresne added a comment - The batchlog is to guarantee eventual consistency of the MV. The user CL is mostly to get guarantee that you can see your writes. I don't understand why you'd think the former would prevent the latter. But anyway, I'll retract part of my previous comment. I'm not really that oppose to having the option of doing asynchronous MV update. Rather I feel that 1) doing synchonous updates (which is equivalent of respecting the CL for the MV update too) should be the default because that's what people will expect (as exemplified by Jon and Brian response above) and 2) that we should discuss the addition of such option in a followup ticket. The reason for the later is that I think there is multiple ways to do this, and I typically don't like the idea of making a table option at all, so I don't see the point of discussing this here.
          Hide
          jbellis Jonathan Ellis added a comment -

          I disagree about making synchronous the default. As Jake points out that can kill your availability even on a single MV if you are unlucky with replica placement, and it's virtually guaranteed to kill it with many MV. I would go so far as to say that synchronous MV updates are not useful and we should not bother adding it.

          Show
          jbellis Jonathan Ellis added a comment - I disagree about making synchronous the default. As Jake points out that can kill your availability even on a single MV if you are unlucky with replica placement, and it's virtually guaranteed to kill it with many MV. I would go so far as to say that synchronous MV updates are not useful and we should not bother adding it.
          Hide
          tjake T Jake Luciani added a comment -

          The batchlog is to guarantee eventual consistency of the MV. The user CL is mostly to get guarantee that you can see your writes. I don't understand why you'd think the former would prevent the latter.

          Sylvain Lebresne the inline reply method broke

          Sorry, I was referring to the batchlog feature in general from a user perspective.

          Show
          tjake T Jake Luciani added a comment - The batchlog is to guarantee eventual consistency of the MV. The user CL is mostly to get guarantee that you can see your writes. I don't understand why you'd think the former would prevent the latter. Sylvain Lebresne the inline reply method broke Sorry, I was referring to the batchlog feature in general from a user perspective.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Well, I'm gonna guess that the absence of synchronous MV updates will surprise and make more than one user sad, but current time constraints force me to choose my battles carefully so I'm gonna leave it at that.

          Show
          slebresne Sylvain Lebresne added a comment - Well, I'm gonna guess that the absence of synchronous MV updates will surprise and make more than one user sad, but current time constraints force me to choose my battles carefully so I'm gonna leave it at that.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          It seems odd to me that as a user I'd ask for a certain consistency level and I would get a successful response that was essentially footnoted with "not really".

          Perhaps I was using batching incorrectly, but in my experience I found it useful to keep multiple views of my data up to date, and I was doing so at QUORUM because I needed to be strongly consistent. Not having it even as an option kills a lot of use cases.

          This next part is up for debate, but is based on the conversations and questions I've had with people at Cassandra Day. This feature is a big deal for people coming from the RDBMS world - probably more so than existing users. There's usually quite a bit of discussion around this topic. At my last talk, I brought up materialized views and people bit onto it like a dog with a bone. I feel like mandatory async is a strange caveat that would be unexpected for these people.

          Show
          rustyrazorblade Jon Haddad added a comment - It seems odd to me that as a user I'd ask for a certain consistency level and I would get a successful response that was essentially footnoted with "not really". Perhaps I was using batching incorrectly, but in my experience I found it useful to keep multiple views of my data up to date, and I was doing so at QUORUM because I needed to be strongly consistent. Not having it even as an option kills a lot of use cases. This next part is up for debate, but is based on the conversations and questions I've had with people at Cassandra Day. This feature is a big deal for people coming from the RDBMS world - probably more so than existing users. There's usually quite a bit of discussion around this topic. At my last talk, I brought up materialized views and people bit onto it like a dog with a bone. I feel like mandatory async is a strange caveat that would be unexpected for these people.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          As an AP database that prides itself for scalability, high availability, and decent write throughput, I don't see how we can ship with sync - a requirement that obliterates both availability and throughput - at least by default.

          Maybe there is a place for a second per-query CL parameter (they way we have with LWT queries), that would specify a CL for MV mutatations (that would default to ANY), maybe not.

          Show
          iamaleksey Aleksey Yeschenko added a comment - As an AP database that prides itself for scalability, high availability, and decent write throughput, I don't see how we can ship with sync - a requirement that obliterates both availability and throughput - at least by default. Maybe there is a place for a second per-query CL parameter (they way we have with LWT queries), that would specify a CL for MV mutatations (that would default to ANY ), maybe not.
          Hide
          tupshin Tupshin Harper added a comment - - edited

          OK, so let me summarize my view of the conflicting viewpoints here

          1. If the MV shares the same partition key (and only reorders the partition based on different clustering columns), then the problem is relatively easy. Unfortunately the general consensus is that a common case will be to have different partition keys in the MV than the base table, so we can't support only that easy case.
          2. If the MV has a different partition key than the base table, then there are inherently more nodes involved in fulfilling the entire request, and we have to address that case.
          3. As T Jake Luciani and Jonathan Ellis say, the more nodes involved in a query, the higher the risk of unavailability if the MV is updated synchronously.
          4. Some use cases expect synchronous updates (as argued by Jon Haddad and Brian Hess
          5. But others use cases definitely do not. I think it is absurd to say that just because a table has a MV, every write should care about the MV. Even more absurd to say that adding an MV to a table will reduce the availability of all writes to the base table.

          Given all of those, the conclusion that both sync and async forms are necessary seems totally unavoidable.

          Ideally, I'd like to see an extension of what Aleksey Yeschenko proposed above but be much more thorough and flexible about it.

          If each request were able to pass multiple consistency-level contracts to the coordinator, each one could represent the expectation for a separate callback at the driver level.
          e.g. A query to a table with a MV could express the following compound consistency levels.

           {LQ, LOCAL_ONE{DC3,DC4}, LQ{MV1,MV2}} 

          That would tell the coordinator to deliver three separate notifications back to the client. One when LQ in the local dc was fulfilled. Another when at least one copy was delivered to each of DC3 and DC4, and another when LQ was fulfilled in the local dc for MV1 and MV2.

          and yes, you would need more flexible syntax that could express per-dc per table consistency, e.g.

          LQ{DCs:DC3,DC4,VIEWS:MV1,MV2}

          I realize that this is a very far-fetched proposal, but I wanted to throw it out there as, imo, it reflects the theoretically best option that fulfills everybody's requirements. (and is also a very general mechanism that could be used in other scenarios).

          Short of that, I don't think there is any choice but to support both sync and async forms of writes to tables with MVs.

          One more point(not to distract from the above). With the current design of MVs, there will always be risk of inconsistent reads (timeouts leaving data queryable in the primary table but not in one or more MVs) until the data is eventually propagated to the MV. While it would be at a high cost, RAMP would still be useful to be to provide read isolation in that scenario.

          Show
          tupshin Tupshin Harper added a comment - - edited OK, so let me summarize my view of the conflicting viewpoints here If the MV shares the same partition key (and only reorders the partition based on different clustering columns), then the problem is relatively easy. Unfortunately the general consensus is that a common case will be to have different partition keys in the MV than the base table, so we can't support only that easy case. If the MV has a different partition key than the base table, then there are inherently more nodes involved in fulfilling the entire request, and we have to address that case. As T Jake Luciani and Jonathan Ellis say, the more nodes involved in a query, the higher the risk of unavailability if the MV is updated synchronously. Some use cases expect synchronous updates (as argued by Jon Haddad and Brian Hess But others use cases definitely do not. I think it is absurd to say that just because a table has a MV, every write should care about the MV. Even more absurd to say that adding an MV to a table will reduce the availability of all writes to the base table. Given all of those, the conclusion that both sync and async forms are necessary seems totally unavoidable. Ideally, I'd like to see an extension of what Aleksey Yeschenko proposed above but be much more thorough and flexible about it. If each request were able to pass multiple consistency-level contracts to the coordinator, each one could represent the expectation for a separate callback at the driver level. e.g. A query to a table with a MV could express the following compound consistency levels. {LQ, LOCAL_ONE{DC3,DC4}, LQ{MV1,MV2}} That would tell the coordinator to deliver three separate notifications back to the client. One when LQ in the local dc was fulfilled. Another when at least one copy was delivered to each of DC3 and DC4, and another when LQ was fulfilled in the local dc for MV1 and MV2. and yes, you would need more flexible syntax that could express per-dc per table consistency, e.g. LQ{DCs:DC3,DC4,VIEWS:MV1,MV2} I realize that this is a very far-fetched proposal, but I wanted to throw it out there as, imo, it reflects the theoretically best option that fulfills everybody's requirements. (and is also a very general mechanism that could be used in other scenarios). Short of that, I don't think there is any choice but to support both sync and async forms of writes to tables with MVs. One more point(not to distract from the above). With the current design of MVs, there will always be risk of inconsistent reads (timeouts leaving data queryable in the primary table but not in one or more MVs) until the data is eventually propagated to the MV. While it would be at a high cost, RAMP would still be useful to be to provide read isolation in that scenario.
          Hide
          benedict Benedict added a comment -

          OK, so I think we have a lot of competing goals that are not being discussed in a clearly delineated fashion:

          • How do we ensure the MV is not corrupted
          • How do we maintain AP
          • How do we best honour user consistency level
          • How do we do it quickly?

          Unfortunately, goals (1) and (2) are directly opposed to each other. I hadn't originally envisaged having 20+ MV, but if that's on the cards, (1) really must reign supreme. Otherwise we will end up with a CL.LITERALLY_EVERYONE scenario.

          The problem here is that the coordinator-level batch log is almost useless, and in fact I think absolutely guaranteeing no-MV corruption is nigh impossible. If the base table update on a replica has been updated, but its paired MV has not, it doesn't matter what the coordinator does with replay, as the base replica will not apply any delta.

          So, I'm pretty sure we can drop the coordinator batch log. What I propose instead, is to always ensure an owner is the coordinator: if a non-owner coordinates, it just proxies it on to one (or more, if a response is slow) owning coordinators.

          _Loosely* speaking, it can then:

          1. write the mutation to the other base-replicas;
          2. perform its local read-before-write, and write to the local batchlog the total set of MV deltas it will apply (along with the base mutation)
          3. write to its paired replica for one of its MVs both the MV mutation and the whole batchlog of mutations, including mutations for the other base replicas
          4. once any of these respond (the other base replicas, or the MV replica), we're GTG as we have confidence that two or more base replicas will receive the mutation, and so we write to the remaining MV replicas

          The non-coordinator base-replicas just perform all of their work at once: they write their deltas (and self mutation) to the local batchlog, then write this batchlog to one of their MV replicas simultaneously to sending all of their updates. This is safe because we already know the coordinator has written this to their batchlog, and so our doing so is enough to reach QUORUM for commit.

          The main advantage of this is that we have no synchronous operations, but we still reasonably guarantee we eventually reach consistency - at least as well as we do currently (I'm not 100% familiar on how we write to the local batch log currently, but ensuring we write deltas at-once for all MVs is critical to correctness). The reason we do not need to perform any synchronous batchlog writes is that if, for whatever reason, we lose all of the nodes involved, then it does not matter that the batchlog records never made it: nor can the base mutations, nor any follow-on MV mutations. The slate is wiped clean.

          I have to say, though: I'm more than a little worried about how repair factors into this equation. I kind of suspect we need to guarantee all of these logs are empty before we run repair, or we need to produce deltas on receipt of repaired data. This is true of all of the outlined approaches.

          Show
          benedict Benedict added a comment - OK, so I think we have a lot of competing goals that are not being discussed in a clearly delineated fashion: How do we ensure the MV is not corrupted How do we maintain AP How do we best honour user consistency level How do we do it quickly? Unfortunately, goals (1) and (2) are directly opposed to each other. I hadn't originally envisaged having 20+ MV, but if that's on the cards, (1) really must reign supreme. Otherwise we will end up with a CL.LITERALLY_EVERYONE scenario. The problem here is that the coordinator-level batch log is almost useless, and in fact I think absolutely guaranteeing no-MV corruption is nigh impossible. If the base table update on a replica has been updated, but its paired MV has not, it doesn't matter what the coordinator does with replay, as the base replica will not apply any delta. So, I'm pretty sure we can drop the coordinator batch log. What I propose instead, is to always ensure an owner is the coordinator: if a non-owner coordinates, it just proxies it on to one (or more, if a response is slow) owning coordinators. _Loosely* speaking, it can then: write the mutation to the other base-replicas; perform its local read-before-write, and write to the local batchlog the total set of MV deltas it will apply (along with the base mutation) write to its paired replica for one of its MVs both the MV mutation and the whole batchlog of mutations, including mutations for the other base replicas once any of these respond (the other base replicas, or the MV replica), we're GTG as we have confidence that two or more base replicas will receive the mutation, and so we write to the remaining MV replicas The non-coordinator base-replicas just perform all of their work at once: they write their deltas (and self mutation) to the local batchlog, then write this batchlog to one of their MV replicas simultaneously to sending all of their updates. This is safe because we already know the coordinator has written this to their batchlog, and so our doing so is enough to reach QUORUM for commit. The main advantage of this is that we have no synchronous operations, but we still reasonably guarantee we eventually reach consistency - at least as well as we do currently (I'm not 100% familiar on how we write to the local batch log currently, but ensuring we write deltas at-once for all MVs is critical to correctness). The reason we do not need to perform any synchronous batchlog writes is that if, for whatever reason, we lose all of the nodes involved, then it does not matter that the batchlog records never made it: nor can the base mutations, nor any follow-on MV mutations. The slate is wiped clean. I have to say, though: I'm more than a little worried about how repair factors into this equation. I kind of suspect we need to guarantee all of these logs are empty before we run repair, or we need to produce deltas on receipt of repaired data. This is true of all of the outlined approaches.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          Personally I think it would be fine as a setting on the view itself.

          Show
          rustyrazorblade Jon Haddad added a comment - Personally I think it would be fine as a setting on the view itself.
          Hide
          jbellis Jonathan Ellis added a comment - - edited

          I'm still skeptical that sync MV updates are useful. I get that people would like to have them, but it's a CAP problem. You only get sync updates if you are willing to throw out availability very very quickly. Async replication to MV may be surprising, but UnavailableException for CL.ONE writes when a single machine goes down is more surprising.

          (FWIW I have consistently told people that our design goal is for MV to be eventually consistent with the base data and everyone has been fine with that.)

          So: we can keep sync as an option for the future, but for 3.0 we should not scope creep this. We have designed it as async which is the right thing for 90% of use cases and that is what we should ship.

          Show
          jbellis Jonathan Ellis added a comment - - edited I'm still skeptical that sync MV updates are useful. I get that people would like to have them, but it's a CAP problem. You only get sync updates if you are willing to throw out availability very very quickly. Async replication to MV may be surprising, but UnavailableException for CL.ONE writes when a single machine goes down is more surprising. (FWIW I have consistently told people that our design goal is for MV to be eventually consistent with the base data and everyone has been fine with that.) So: we can keep sync as an option for the future, but for 3.0 we should not scope creep this. We have designed it as async which is the right thing for 90% of use cases and that is what we should ship.
          Hide
          jshook Jonathan Shook added a comment - - edited

          If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic:

          • Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case.
          • The same CL is generally used for all related tables.
          • Savvy users will do this with async with the same CL for all of these operations.

          So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because they are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults.

          It doesn't really seem necessary that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above.

          Show
          jshook Jonathan Shook added a comment - - edited If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic: Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case. The same CL is generally used for all related tables. Savvy users will do this with async with the same CL for all of these operations. So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because they are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults. It doesn't really seem necessary that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          May I then suggest that we add that as a clause to the CREATE MATERIALIZED VIEW statement, so it's 1) explicit and 2) is future proofed?

          Show
          rustyrazorblade Jon Haddad added a comment - May I then suggest that we add that as a clause to the CREATE MATERIALIZED VIEW statement, so it's 1) explicit and 2) is future proofed?
          Hide
          brianmhess Brian Hess added a comment -

          Let's go back to the basic use case that this is supposed to replace/help/make better. The case where we want two query tables for the same data. That is, they have the same primary keys, but different partition keys (and clustering column orders).

          Today, I would do this by having a logged batch for the insert and that batch would insert into each of the two query tables. With this I get some data consistency guarantees. For example, if the client returns "success", I know that both of the inserts were accepted at the desired consistency level. So, if I did 2 writes at CL_QUORUM, and I receive a "success", then I know I can then do a CL_QUORUM read of either table and see the most recent data.

          However, with this "asynchronous" MV approach, I no longer get this behavior. I write to the base table at CL_QUORUM and get the "success" return. At that point, I can do a CL_QUORUM read from the base table and see the most recent insert. However, if I do a CL_QUORUM read from the MV, I have no guarantees at all.

          This approach does not address the basic situation that we are trying to cover.

          Show
          brianmhess Brian Hess added a comment - Let's go back to the basic use case that this is supposed to replace/help/make better. The case where we want two query tables for the same data. That is, they have the same primary keys, but different partition keys (and clustering column orders). Today, I would do this by having a logged batch for the insert and that batch would insert into each of the two query tables. With this I get some data consistency guarantees. For example, if the client returns "success", I know that both of the inserts were accepted at the desired consistency level. So, if I did 2 writes at CL_QUORUM, and I receive a "success", then I know I can then do a CL_QUORUM read of either table and see the most recent data. However, with this "asynchronous" MV approach, I no longer get this behavior. I write to the base table at CL_QUORUM and get the "success" return. At that point, I can do a CL_QUORUM read from the base table and see the most recent insert. However, if I do a CL_QUORUM read from the MV, I have no guarantees at all. This approach does not address the basic situation that we are trying to cover.
          Hide
          benedict Benedict added a comment -

          As well, it does add potentially another hop,

          No, it removes a hop. Several hops: Currently we write to the batch log synchronously before we do anything else. In fact we do it twice. These are all double-hops. I'm proposing removing all of them, by exploiting the fact that typically the coordinator is an owner, and so as soon as two owners have the mutations in their local batch logs we're about as good as we'll get that these mutations will be applied, and if they aren't going to be applied they also won't partially apply.

          To make a more general statement: I think there is perhaps a failure to appreciate the difficulty of the problem Sylvain was raising, around ensuring we do not end up with a corrupted MV that we have no idea how to repair. In fact, I very much doubt we are going to achieve this no matter how hard we try, and this is going to end up making us look pretty bad. If a user fails to keep their views in sync, that's one thing, but if we do it's another. I'm talking about even eventually consistent here. There are a lot of nook and cranny failure scenarios and that's before we consider repair. Which, as I say, I suspect completely ruins us. Ironically, of course

          Show
          benedict Benedict added a comment - As well, it does add potentially another hop, No, it removes a hop. Several hops: Currently we write to the batch log synchronously before we do anything else. In fact we do it twice. These are all double-hops. I'm proposing removing all of them, by exploiting the fact that typically the coordinator is an owner, and so as soon as two owners have the mutations in their local batch logs we're about as good as we'll get that these mutations will be applied, and if they aren't going to be applied they also won't partially apply. To make a more general statement: I think there is perhaps a failure to appreciate the difficulty of the problem Sylvain was raising, around ensuring we do not end up with a corrupted MV that we have no idea how to repair. In fact, I very much doubt we are going to achieve this no matter how hard we try, and this is going to end up making us look pretty bad. If a user fails to keep their views in sync, that's one thing, but if we do it's another. I'm talking about even eventually consistent here. There are a lot of nook and cranny failure scenarios and that's before we consider repair. Which, as I say, I suspect completely ruins us. Ironically, of course
          Hide
          jbellis Jonathan Ellis added a comment -

          That's because that's the special case where the user knows there is no early value to delete out of the MV. We can't assume that here until we have CASSANDRA-9779.

          If you're hand-rolling an MV that needs to deal with updated values you'll hit the exact same issues we're discussing here.

          That is if you think it through enough. I would guess that most people do not.

          Show
          jbellis Jonathan Ellis added a comment - That's because that's the special case where the user knows there is no early value to delete out of the MV. We can't assume that here until we have CASSANDRA-9779 . If you're hand-rolling an MV that needs to deal with updated values you'll hit the exact same issues we're discussing here. That is if you think it through enough. I would guess that most people do not.
          Hide
          jbellis Jonathan Ellis added a comment -

          No. If we do add a sync option it will not be the default so there are no forward compatibility issues.

          Show
          jbellis Jonathan Ellis added a comment - No. If we do add a sync option it will not be the default so there are no forward compatibility issues.
          Hide
          jshook Jonathan Shook added a comment -

          This goes directly to my point. It would be ideal if we simply allow users to simplify what they are already doing with the least amount of "special" handling we can add to the mix. In terms of solving the problem in a way that users understand, we must strive to compose a solution from the already established primitives that we teach users about all the time. Any failure modes should be explained in those terms as well. Other approach are likely to create more "special cases", which I think we all can agree are not good for anybody.

          Show
          jshook Jonathan Shook added a comment - This goes directly to my point. It would be ideal if we simply allow users to simplify what they are already doing with the least amount of "special" handling we can add to the mix. In terms of solving the problem in a way that users understand, we must strive to compose a solution from the already established primitives that we teach users about all the time. Any failure modes should be explained in those terms as well. Other approach are likely to create more "special cases", which I think we all can agree are not good for anybody.
          Hide
          jshook Jonathan Shook added a comment -

          The comment about adding a hop was with respect to what users would currently be doing to maintain multiple views of data. They don't expect that there is a proxy proxy for their writes, no matter whether they are using async or not, batches or not.

          Show
          jshook Jonathan Shook added a comment - The comment about adding a hop was with respect to what users would currently be doing to maintain multiple views of data. They don't expect that there is a proxy proxy for their writes, no matter whether they are using async or not, batches or not.
          Hide
          slebresne Sylvain Lebresne added a comment -

          We have designed it as async

          Unless I've missed some part of how "we designed it", that's actually the opposite. In fact, the reasoning I expressed in my previous comment, which to the best of my knowledge is the reasoning being the design currently implemented, was relying on sync MV updates for correction.

          I say "was" because all this probably doesn't really matter since Benedict has a point: with or without sync MV updates, there is holes in this design. Both because, to repeat what he says, replaying mutations from the coordinator batch log doesn't work if the mutation was applied on the base replica but not the paired MV, but also because repair can screw us up by propagating base-table updates behind our backs without going through the MV update code path. In other words, the design as implemented is not eventually consistent (and in the case of the repair problem, you don't even need any failure to get permanent inconsistencies, just unlucky timing).

          And I don't see an easy tweak to the implemented design that would fix both problems. Benedict suggestion above of "batch-logging deltas" might work for the first problem, but we'd need to look at the details and convince ourselves it does work in all cases before even thinking of implementing it (at which point it will unlikely be a minor change to the current patch), and that still leave the repair problem to solve. And so it kind of put us back to the drawing board, and I think ideas around using cell reconciliation to generate fix to MVs (as was initially considered) would also be worth reconsidering (I'm not saying I have anything close to a complete design in mind, and there is for sure a number of problems to solve in that case too, but it does feel like it would handle the repair problem natively if we can make it work, which is a plus).

          Anyway, I would humbly suggest we focus on that problem first, because unless we're willing to release MVs with "hopeful consistency" (which, to put it mildly, I'm not a fan), this might well compromise a 3.0 release.

          Show
          slebresne Sylvain Lebresne added a comment - We have designed it as async Unless I've missed some part of how "we designed it", that's actually the opposite. In fact, the reasoning I expressed in my previous comment , which to the best of my knowledge is the reasoning being the design currently implemented, was relying on sync MV updates for correction. I say "was" because all this probably doesn't really matter since Benedict has a point: with or without sync MV updates, there is holes in this design. Both because, to repeat what he says, replaying mutations from the coordinator batch log doesn't work if the mutation was applied on the base replica but not the paired MV, but also because repair can screw us up by propagating base-table updates behind our backs without going through the MV update code path. In other words, the design as implemented is not eventually consistent (and in the case of the repair problem, you don't even need any failure to get permanent inconsistencies, just unlucky timing). And I don't see an easy tweak to the implemented design that would fix both problems. Benedict suggestion above of "batch-logging deltas" might work for the first problem, but we'd need to look at the details and convince ourselves it does work in all cases before even thinking of implementing it (at which point it will unlikely be a minor change to the current patch), and that still leave the repair problem to solve. And so it kind of put us back to the drawing board, and I think ideas around using cell reconciliation to generate fix to MVs (as was initially considered) would also be worth reconsidering (I'm not saying I have anything close to a complete design in mind, and there is for sure a number of problems to solve in that case too, but it does feel like it would handle the repair problem natively if we can make it work, which is a plus). Anyway, I would humbly suggest we focus on that problem first, because unless we're willing to release MVs with "hopeful consistency" (which, to put it mildly, I'm not a fan), this might well compromise a 3.0 release.
          Hide
          benedict Benedict added a comment - - edited

          +Inf to what Sylvain just said. We should definitely step back and reassess all the options.

          Now, I think it is probably sufficient for correctness to perform our delta generation on reconciliation of repaired with unrepaired data. Or, more simply: to treat all repaired data we receive as needing delta-generation performed on it. I don't think at first blush this is more or less onerous than would be required for any other approach, but I may be wrong. This definitely needs a lot of thought.

          Show
          benedict Benedict added a comment - - edited +Inf to what Sylvain just said. We should definitely step back and reassess all the options. Now, I think it is probably sufficient for correctness to perform our delta generation on reconciliation of repaired with unrepaired data. Or, more simply: to treat all repaired data we receive as needing delta-generation performed on it. I don't think at first blush this is more or less onerous than would be required for any other approach, but I may be wrong. This definitely needs a lot of thought.
          Hide
          tjake T Jake Luciani added a comment -

          If the base table update on a replica has been updated, but its paired MV has not, it doesn't matter what the coordinator does with replay, as the base replica will not apply any delta.

          I think you two are missing something about the current implementation. This problem of updating the base but not the MV is why we added the second local batchlog for the Base to MV mutations.

          When the mutation comes into the base replica.

          • Get lock for partition
          • Generate the delta mutations for the paired view replica
          • Update the base replica commit log
          • Write the delta mutations for the paired MV replia on the local base BL
          • Fire the mutations to the paired MV replica async
          • Update the base memtable

          We know the data will eventually get to the paired MV because of the batchlog. That's why we can send the MV updates async.

          ------

          Show
          tjake T Jake Luciani added a comment - If the base table update on a replica has been updated, but its paired MV has not, it doesn't matter what the coordinator does with replay, as the base replica will not apply any delta. I think you two are missing something about the current implementation. This problem of updating the base but not the MV is why we added the second local batchlog for the Base to MV mutations. When the mutation comes into the base replica. Get lock for partition Generate the delta mutations for the paired view replica Update the base replica commit log Write the delta mutations for the paired MV replia on the local base BL Fire the mutations to the paired MV replica async Update the base memtable We know the data will eventually get to the paired MV because of the batchlog. That's why we can send the MV updates async. ------
          Hide
          tjake T Jake Luciani added a comment -

          The repair bypassing the mutation stage is something we need to address with this patch. But it makes me think we need CASSANDRA-8911

          Show
          tjake T Jake Luciani added a comment - The repair bypassing the mutation stage is something we need to address with this patch. But it makes me think we need CASSANDRA-8911
          Hide
          benedict Benedict added a comment -

          To clarify: my comment was meant to address both correctness and performance. From a correctness point of view it is approximately equivalent to what we have now, it just merges the batchlog work with other work to remove synchronous round trips.

          Show
          benedict Benedict added a comment - To clarify: my comment was meant to address both correctness and performance. From a correctness point of view it is approximately equivalent to what we have now, it just merges the batchlog work with other work to remove synchronous round trips.
          Hide
          benedict Benedict added a comment -

          It looks like mutation based repair might well solve the repair problem. Although we might have to disable repair for MV replicas.

          Question is: can we wait for it?

          Show
          benedict Benedict added a comment - It looks like mutation based repair might well solve the repair problem. Although we might have to disable repair for MV replicas. Question is: can we wait for it?