Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-9928

Add Support for multiple non-primary key columns in Materialized View primary keys

    Details

      Description

      Currently we don't allow > 1 non primary key from the base table in a MV primary key. We should remove this restriction assuming we continue filtering out nulls. With allowing nulls in the MV columns there are a lot of multiplicative implications we need to think through.

        Issue Links

          Activity

          Hide
          carlyeks Carl Yeksigian added a comment -

          Benedict brought up some potential issues in a comment on CASSANDRA-6477:

          As far as multiple columns are concerned: I think we may need to go back to the drawing board there. It's actually really easy to demonstrate the cluster getting into broken states. Say you have three columns, A B C, and you send three competing updates a b c to their respective columns; previously all held the value _. If they arrive in different orders on each base-replica we can end up with 6 different MV states around the cluster. If any base replica dies, you don't know which of those 6 intermediate states were taken (and probably replicated) by its MV replicas. This problem grows exponentially as you add "competing" updates (which, given split brain, can compete over arbitrarily long intervals).

          This is where my concern about a "single (base) node" dependency comes in, but after consideration it's clear that with a single column this problem is avoided because it's never ambiguous what the old state was. If you encounter a mutation that is shadowed by your current data, you can always issue a delete for the correct prior state. With multiple columns that is no longer possible.

          I'm pretty sure the presence of multiple columns introduces other issues with each of the other moving parts.

          When we implement this feature, we should make sure to also add jepsen tests for the possible problems.

          Show
          carlyeks Carl Yeksigian added a comment - Benedict brought up some potential issues in a comment on CASSANDRA-6477 : As far as multiple columns are concerned: I think we may need to go back to the drawing board there. It's actually really easy to demonstrate the cluster getting into broken states. Say you have three columns, A B C, and you send three competing updates a b c to their respective columns; previously all held the value _. If they arrive in different orders on each base-replica we can end up with 6 different MV states around the cluster. If any base replica dies, you don't know which of those 6 intermediate states were taken (and probably replicated) by its MV replicas. This problem grows exponentially as you add "competing" updates (which, given split brain, can compete over arbitrarily long intervals). This is where my concern about a "single (base) node" dependency comes in, but after consideration it's clear that with a single column this problem is avoided because it's never ambiguous what the old state was. If you encounter a mutation that is shadowed by your current data, you can always issue a delete for the correct prior state. With multiple columns that is no longer possible. I'm pretty sure the presence of multiple columns introduces other issues with each of the other moving parts. When we implement this feature, we should make sure to also add jepsen tests for the possible problems.
          Hide
          tjake T Jake Luciani added a comment -

          This scenario where 3 nodes won't see each others updates can't happen if we use the coordinator batchlog, since we guarantee at least a quorum of nodes will see the updates. Mentioning this for CASSANDRA-10230

          Show
          tjake T Jake Luciani added a comment - This scenario where 3 nodes won't see each others updates can't happen if we use the coordinator batchlog, since we guarantee at least a quorum of nodes will see the updates. Mentioning this for CASSANDRA-10230
          Hide
          jkni Joel Knighton added a comment -

          Guaranteeing a quorum of nodes will see the updates does not solve the problem because supporting multiple non-primary key columns in the materialized view primary key introduces a sensitivity to the ordering of updates to these non-primary key columns.

          I think this is the simplest version of Benedict's example. Envision a cluster with a table with primary key P and columns A and B. Presently, all replicas contain an entry for P=1, A=1, B=1.

          Two concurrent updates are occurring - one setting A=2, and one setting B=2. One replica receives the update B=2, removes the MV entry for P=1, A=1, B=1, creates an MV entry for P=1, A=1, B=2, and then crashes with data loss. The remainder of the base replicas receive the update A=2; remove the MV entry for P=1, A=1, B=1; create an MV entry for P=1, A=2, B=1; receive the update B=2; remove the MV entry for P=1, A=2, B=1; and create an MV entry for P=1, A=2, B=2.

          Upon repairing the data lost base replica from the remaining base replicas, a delete for the entry P=1, A=1, B=2 in the paired replica will never be created.

          Show
          jkni Joel Knighton added a comment - Guaranteeing a quorum of nodes will see the updates does not solve the problem because supporting multiple non-primary key columns in the materialized view primary key introduces a sensitivity to the ordering of updates to these non-primary key columns. I think this is the simplest version of Benedict's example. Envision a cluster with a table with primary key P and columns A and B. Presently, all replicas contain an entry for P=1, A=1, B=1. Two concurrent updates are occurring - one setting A=2, and one setting B=2. One replica receives the update B=2, removes the MV entry for P=1, A=1, B=1, creates an MV entry for P=1, A=1, B=2, and then crashes with data loss. The remainder of the base replicas receive the update A=2; remove the MV entry for P=1, A=1, B=1; create an MV entry for P=1, A=2, B=1; receive the update B=2; remove the MV entry for P=1, A=2, B=1; and create an MV entry for P=1, A=2, B=2. Upon repairing the data lost base replica from the remaining base replicas, a delete for the entry P=1, A=1, B=2 in the paired replica will never be created.
          Hide
          tjake T Jake Luciani added a comment -

          of course! thx.

          Show
          tjake T Jake Luciani added a comment - of course! thx.
          Hide
          thobbs Tyler Hobbs added a comment - - edited

          One possible solution is to require that all non-PK columns that are in a view PK be updated simultaneously. T Jake Luciani mentioned possible problems from read repair, but it seems like with this restriction in place, any read repairs would end up repairing all non-PK columns at once.

          This would also solve (or avoid) CASSANDRA-10226.

          Show
          thobbs Tyler Hobbs added a comment - - edited One possible solution is to require that all non-PK columns that are in a view PK be updated simultaneously. T Jake Luciani mentioned possible problems from read repair, but it seems like with this restriction in place, any read repairs would end up repairing all non-PK columns at once. This would also solve (or avoid) CASSANDRA-10226 .
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          Matthias Broecheler Is the limitation outlined by Tyler above still compatible with the use case you got in mind?

          Show
          iamaleksey Aleksey Yeschenko added a comment - Matthias Broecheler Is the limitation outlined by Tyler above still compatible with the use case you got in mind?
          Hide
          mbroecheler Matthias Broecheler added a comment -

          Requiring that all non-PK columns that are indexed be updated at the same time is too much of a limitation and will be really hard for users to understand imho.

          Instead, I would propose that we introduce a rate-change limit on the columns that participate in an MV. If we can guarantee that there is a limit on the number of changes to those columns than we can limit the number of distinct state permutations that may need to be considered in the scenarios above. In those cases, we would simply enumerate them and then delete all possible old MV states.This sounds expensive but it would only be expensive when change happen in fast succession or under extraordinary operational conditions - i.e. the cost should amortize well.

          As for the rate limit, it seems that this would be a rather arbitrary limitation but if somebody changes their MV columns in rapid succession then they are pursuing an anti-pattern and throwing an exception would be a better response that deteriorating system performance.

          Show
          mbroecheler Matthias Broecheler added a comment - Requiring that all non-PK columns that are indexed be updated at the same time is too much of a limitation and will be really hard for users to understand imho. Instead, I would propose that we introduce a rate-change limit on the columns that participate in an MV. If we can guarantee that there is a limit on the number of changes to those columns than we can limit the number of distinct state permutations that may need to be considered in the scenarios above. In those cases, we would simply enumerate them and then delete all possible old MV states.This sounds expensive but it would only be expensive when change happen in fast succession or under extraordinary operational conditions - i.e. the cost should amortize well. As for the rate limit, it seems that this would be a rather arbitrary limitation but if somebody changes their MV columns in rapid succession then they are pursuing an anti-pattern and throwing an exception would be a better response that deteriorating system performance.
          Hide
          tjake T Jake Luciani added a comment -

          If we can guarantee that there is a limit on the number of changes to those columns than we can limit the number of distinct state permutations that may need to be considered in the scenarios above.

          How do you propose we limit changes across the cluster and DCs? Tyler's suggestion is easy to guarantee without introducing some global rate limiting.

          Show
          tjake T Jake Luciani added a comment - If we can guarantee that there is a limit on the number of changes to those columns than we can limit the number of distinct state permutations that may need to be considered in the scenarios above. How do you propose we limit changes across the cluster and DCs? Tyler's suggestion is easy to guarantee without introducing some global rate limiting.
          Hide
          dhsieh_9@yahoo.com Donovan Hsieh added a comment -

          Whatever technical issues associated with race condition stated above and limit to just 1 non-PK column, imho, make MV seriously handicapped. If this limitation is not removed, I can't see any serious real world applications can use MV effectively.

          Show
          dhsieh_9@yahoo.com Donovan Hsieh added a comment - Whatever technical issues associated with race condition stated above and limit to just 1 non-PK column, imho, make MV seriously handicapped. If this limitation is not removed, I can't see any serious real world applications can use MV effectively.
          Hide
          ascarpinelli Ariel Scarpinelli added a comment - - edited

          Why not letting people decide?
          If you implement Tyler Hobbs solution then people simply gets warned: "non-PK columns participating in MV PKs will need to be updated together". Then it becomes the user responsibility to choose if they prefer to be tied to that restriction, or use a single column (which is tied to that restriction anyway, but since you always update columns in a minimum set of 1 ... :-D) . The current way it is implemented you are not letting people with other choice than fabricating a "fake" column that concatenates values or so... which effectively translates in having to update in tandem anyways but also adding complexity and repeated data.

          Also even better, you can start with the restriction as a step solution, and then relax the restriction with the other implementation. Your current decision was simply to leave it unattended for over a year I would guess in the discussion of how to implement it.

          Show
          ascarpinelli Ariel Scarpinelli added a comment - - edited Why not letting people decide? If you implement Tyler Hobbs solution then people simply gets warned: "non-PK columns participating in MV PKs will need to be updated together". Then it becomes the user responsibility to choose if they prefer to be tied to that restriction, or use a single column (which is tied to that restriction anyway, but since you always update columns in a minimum set of 1 ... :-D) . The current way it is implemented you are not letting people with other choice than fabricating a "fake" column that concatenates values or so... which effectively translates in having to update in tandem anyways but also adding complexity and repeated data. Also even better, you can start with the restriction as a step solution, and then relax the restriction with the other implementation. Your current decision was simply to leave it unattended for over a year I would guess in the discussion of how to implement it.
          Hide
          mccraigmccraig craig mcmillan added a comment -

          currently i achieve this function by manually concatenating the extra keys i want in the MV into a single text key - it's roughly workable, but timeuuids can no longer be used to provide ordering, since they don't sort lexically

          Tyler Hobbs solution would formalize and improve upon what i, and presumably many others, are already having to do ?

          Show
          mccraigmccraig craig mcmillan added a comment - currently i achieve this function by manually concatenating the extra keys i want in the MV into a single text key - it's roughly workable, but timeuuids can no longer be used to provide ordering, since they don't sort lexically Tyler Hobbs solution would formalize and improve upon what i, and presumably many others, are already having to do ?
          Hide
          michaelsembwever mck added a comment -

          Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.
            ref https://s.apache.org/EHBy

          Show
          michaelsembwever mck added a comment - Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.   ref https://s.apache.org/EHBy
          Hide
          fsander Fridtjof Sander added a comment -

          I seem to be the only one, who doesn't understanding where the actual difference to the single-column case is:
          Consider (p=1, a=1) with an index-MV on a and two updates a=2 and a=3. One base-replica receives a=2, deletes view entry (a=1, p=1) and inserts (a=2, p=1), then dies. Other base-replicas get a=3, delete (a=1, p=1) and insert (a=3, p=1). Now, how is (a=2, p=1) removed from the view replica that was paired with the dying base-node? I don't get what's different here. Or does my analog case miss the point?

          Show
          fsander Fridtjof Sander added a comment - I seem to be the only one, who doesn't understanding where the actual difference to the single-column case is: Consider (p=1, a=1) with an index-MV on a and two updates a=2 and a=3 . One base-replica receives a=2 , deletes view entry (a=1, p=1) and inserts (a=2, p=1) , then dies. Other base-replicas get a=3 , delete (a=1, p=1) and insert (a=3, p=1) . Now, how is (a=2, p=1) removed from the view replica that was paired with the dying base-node? I don't get what's different here. Or does my analog case miss the point?

            People

            • Assignee:
              Unassigned
              Reporter:
              tjake T Jake Luciani
            • Votes:
              4 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:

                Development