Details

      Description

      SSTables streamed during the repair process will first be written locally and afterwards either simply added to the pool of existing sstables or, in case of existing MVs or active CDC, replayed on mutation basis:

      As described in StreamReceiveTask.OnCompletionRunnable:

      We have a special path for views and for CDC.

      For views, since the view requires cleaning up any pre-existing state, we must put all partitions through the same write path as normal mutations. This also ensures any 2is are also updated.

      For CDC-enabled tables, we want to ensure that the mutations are run through the CommitLog so they can be archived by the CDC process on discard.

      Using the regular write path turns out to be an issue for incremental repairs, as we loose the repaired_at state in the process. Eventually the streamed rows will end up in the unrepaired set, in contrast to the rows on the sender site moved to the repaired set. The next repair run will stream the same data back again, causing rows to bounce on and on between nodes on each repair.

      See linked dtest on steps to reproduce. An example for reproducing this manually using ccm can be found here

        Issue Links

          Activity

          Hide
          pauloricardomg Paulo Motta added a comment -

          The quickest (but a bit dirty) fix here would be to skip anti-compaction altogether when there is a mismatch for MV tables, forcing data to be re-compared at the next repair - and if they match, do anticompaction.

          The proper solution is to segregate repaired from unrepaired data on the memtable, and flush them to separate repaired/unrepaired sstables, but this would probably be a bit more involved).

          Show
          pauloricardomg Paulo Motta added a comment - The quickest (but a bit dirty) fix here would be to skip anti-compaction altogether when there is a mismatch for MV tables, forcing data to be re-compared at the next repair - and if they match, do anticompaction. The proper solution is to segregate repaired from unrepaired data on the memtable, and flush them to separate repaired/unrepaired sstables, but this would probably be a bit more involved).
          Hide
          bdeggleston Blake Eggleston added a comment -

          skip anti-compaction altogether when there is a mismatch for MV tables

          I'm not sure how effective that would be in practice. In an active cluster, I'd expect the race between in flight mutations and flushes to usually result in at least a little bit of streaming.

          Show
          bdeggleston Blake Eggleston added a comment - skip anti-compaction altogether when there is a mismatch for MV tables I'm not sure how effective that would be in practice. In an active cluster, I'd expect the race between in flight mutations and flushes to usually result in at least a little bit of streaming.
          Hide
          pauloricardomg Paulo Motta added a comment -

          I'm not sure how effective that would be in practice. In an active cluster, I'd expect the race between in flight mutations and flushes to usually result in at least a little bit of streaming.

          Good point! How about keeping the streamed sstables, and having a special mutation.apply path that only writes to the commit log/CDC and apply MVs, while skipping applying mutations of the base table? That seems simpler than keeping repaired state at the memtable, unless there are caveats I'm missing.

          Show
          pauloricardomg Paulo Motta added a comment - I'm not sure how effective that would be in practice. In an active cluster, I'd expect the race between in flight mutations and flushes to usually result in at least a little bit of streaming. Good point! How about keeping the streamed sstables, and having a special mutation.apply path that only writes to the commit log/CDC and apply MVs, while skipping applying mutations of the base table? That seems simpler than keeping repaired state at the memtable, unless there are caveats I'm missing.
          Hide
          spodxx@gmail.com Stefan Podkowinski added a comment -

          What also puzzles me is that MVs are getting repaired just as usual tables would, in case of running repair for a single keyspace. Once CFs have been retrieved for a KS, no filtering for MVs seem to happen and they'll be repaired just as regular tables. I'm not sure if that's intentional. There should be really just a single way to repair MVs, either by depending on keeping them consistent with the base table and exclusively following the mutation based approach as described in this ticket, or by treating MVs just a regular tables during repairs. However, in the second case we'd have to make sure to always include all MVs for repaired tables in a repair session, in order to keep them consistent.

          Show
          spodxx@gmail.com Stefan Podkowinski added a comment - What also puzzles me is that MVs are getting repaired just as usual tables would, in case of running repair for a single keyspace. Once CFs have been retrieved for a KS, no filtering for MVs seem to happen and they'll be repaired just as regular tables. I'm not sure if that's intentional. There should be really just a single way to repair MVs, either by depending on keeping them consistent with the base table and exclusively following the mutation based approach as described in this ticket, or by treating MVs just a regular tables during repairs. However, in the second case we'd have to make sure to always include all MVs for repaired tables in a repair session, in order to keep them consistent.
          Hide
          tjake T Jake Luciani added a comment -

          Once CFs have been retrieved for a KS, no filtering for MVs seem to happen and they'll be repaired just as regular tables. I'm not sure if that's intentional.

          This was intentional, though you really shouldn't repair the whole KS with MVs in it, but rather just the base tables. You can repair only the views in the case of a corrupt/lost sstable in the view since it would be faster (as it wouldn't go through the memtable).

          Show
          tjake T Jake Luciani added a comment - Once CFs have been retrieved for a KS, no filtering for MVs seem to happen and they'll be repaired just as regular tables. I'm not sure if that's intentional. This was intentional, though you really shouldn't repair the whole KS with MVs in it, but rather just the base tables. You can repair only the views in the case of a corrupt/lost sstable in the view since it would be faster (as it wouldn't go through the memtable).
          Hide
          bdeggleston Blake Eggleston added a comment -

          How about keeping the streamed sstables, and having a special mutation.apply path that only writes to the commit log/CDC and apply MVs

          That does seem like it would be the simplest way to do it. You'd just need to skip this loop. When those commit log entries get replayed on startup though, you'll have repaired data leak into unrepaired (or at least be duplicated there), causing data to be re-streamed on the next repair. You could avoid that with a 'cdc-only' flag on the commit log entry though.

          Show
          bdeggleston Blake Eggleston added a comment - How about keeping the streamed sstables, and having a special mutation.apply path that only writes to the commit log/CDC and apply MVs That does seem like it would be the simplest way to do it. You'd just need to skip this loop . When those commit log entries get replayed on startup though, you'll have repaired data leak into unrepaired (or at least be duplicated there), causing data to be re-streamed on the next repair. You could avoid that with a 'cdc-only' flag on the commit log entry though.
          Hide
          brstgt Benjamin Roth added a comment -

          https://github.com/Jaumo/cassandra/tree/12905-3.0 (see 7ab84f86fbafeb02e1ed7fd75561c7e598e10a1d)
          +
          https://github.com/Jaumo/cassandra/tree/12905-3.X
          (see 7ab84f86fbafeb02e1ed7fd75561c7e598e10a1d)

          Previous commit belongs to 12905 but 12888 depends on it.

          Show
          brstgt Benjamin Roth added a comment - https://github.com/Jaumo/cassandra/tree/12905-3.0 (see 7ab84f86fbafeb02e1ed7fd75561c7e598e10a1d) + https://github.com/Jaumo/cassandra/tree/12905-3.X (see 7ab84f86fbafeb02e1ed7fd75561c7e598e10a1d) Previous commit belongs to 12905 but 12888 depends on it.
          Hide
          brstgt Benjamin Roth added a comment -

          Let me explain my patch:

          Sending streams of tables with MVs (or CDC) through the regular write path has many very big negative impacts:

          1. Bootstrap
          During a bootstrap, all ranges from all KS and all CFs that will belong to
          the new node will be streamed. MVs are treated like all other CFs and all
          ranges that will move to the new node will also be streamed during
          bootstrap.
          Sending streams of the base tables through the write path will have the
          following negative impacts:

          • Writes are sent to the commit log. Not necessary. When node is stopped
            during bootstrap, bootstrap will simply start over. No need to recover from
            commit logs. Non-MV tables won't have a CL anyway
          • MV mutations will not be applied instantly but send to the batch log.
            This is of course necessary during the range movement (if PK of MV differs
            from base table) but what happens: The batchlog will be completely flooded.
            This leads to ridiculously large batchlogs (I observed BLs with 60GB
            size), zillions of compactions and quadrillions of tombstones. This is a
            pure resource killer, especially because BL uses a CF as a queue.
          • Applying every mutation separately causes read-before-writes during MV
            mutation. This is of course an order of magnitude slower than simply
            streaming down an SSTable. This effect becomes even worse while bootstrap
            progresses and creates more and more (uncompacted) SSTables. Many of them
            wont ever be compacted because the batchlog eats all the resources
            available for compaction
          • Streaming down the MV tables AND applying the mutations of the
            basetables leads to redundant writes. Redundant writes are local if PK of
            the MV == PK of the base table and - even worse - remote if not. Remote MV
            updates will impact nodes that aren't even part of the bootstrap.
          • CDC should also not be necessary during bootstrap. A bootstrap is no data change. It is a data relocation and all data changes must have been logged on the source node before

          2. Repair
          Negative impact is similar to bootstrap but, ...

          • Sending repairs through write path will not mark the streamed tables
            as repaired. Doing NOT so will instantly solve this
            issue. Much simpler with any other solution
          • It will change the "repair design" a bit. Repairing a base table will
            not automatically repair the MV. But is this bad at all? To be honest it was very hard for me to understand what I had to do to be sure
            that everything is repaired correctly. Recently I was told NOT to repair MV
            CFs but only to repair the base tables (see comment above from T Jake Luciani]). This means one cannot just call
            "nodetool repair $keyspace" - this is complicated, not transparent and it
            sucks. I changed the behaviour in my own branch and let run the dtests for
            MVs. 2 tests failed:
          • base_replica_repair_test of course fails due to the design change
          • really_complex_repair_test fails because it intentionally times out
            the batch log. IMHO this is a bearable situation. It is comparable to
            resurrected tombstones when running a repair after GCGS expired. You also
            would not expect this to be magically fixed. gcgs default is 10
            days and I
            can expect that anybody also repairs its MVs during that period, not only
            the base table. I'd suggest to simply delete these 2 tests. They prove nothing any more.

          3. Rebuild + Decommision
          Similar impacts like bootstrap + repair

          I rolled out these changes on our production cluster (including CASSANDRA-12905 + CASSANDRA-12984).
          Before, I was not able to bootstrap a node with a load of roughly 280GB. Either it failed due to WTE (see 12905), it flooded the logs completely with hint delivery failures (also fixed in 12905) and after having fixed that, the bootstrap didn't even finish within 24h, why I cancelled it.
          After applying the mentioned changes, the bootstrap finished below 5:30h. Repairs also seem to run quite smoothly so far. Even though it does not fix CASSANDRA-12730 which is a different story.

          Any thoughts on this?
          Anybody there likes to review that?

          Show
          brstgt Benjamin Roth added a comment - Let me explain my patch: Sending streams of tables with MVs (or CDC) through the regular write path has many very big negative impacts: 1. Bootstrap During a bootstrap, all ranges from all KS and all CFs that will belong to the new node will be streamed. MVs are treated like all other CFs and all ranges that will move to the new node will also be streamed during bootstrap. Sending streams of the base tables through the write path will have the following negative impacts: Writes are sent to the commit log. Not necessary. When node is stopped during bootstrap, bootstrap will simply start over. No need to recover from commit logs. Non-MV tables won't have a CL anyway MV mutations will not be applied instantly but send to the batch log. This is of course necessary during the range movement (if PK of MV differs from base table) but what happens: The batchlog will be completely flooded. This leads to ridiculously large batchlogs (I observed BLs with 60GB size), zillions of compactions and quadrillions of tombstones. This is a pure resource killer, especially because BL uses a CF as a queue. Applying every mutation separately causes read-before-writes during MV mutation. This is of course an order of magnitude slower than simply streaming down an SSTable. This effect becomes even worse while bootstrap progresses and creates more and more (uncompacted) SSTables. Many of them wont ever be compacted because the batchlog eats all the resources available for compaction Streaming down the MV tables AND applying the mutations of the basetables leads to redundant writes. Redundant writes are local if PK of the MV == PK of the base table and - even worse - remote if not. Remote MV updates will impact nodes that aren't even part of the bootstrap. CDC should also not be necessary during bootstrap. A bootstrap is no data change. It is a data relocation and all data changes must have been logged on the source node before 2. Repair Negative impact is similar to bootstrap but, ... Sending repairs through write path will not mark the streamed tables as repaired. Doing NOT so will instantly solve this issue. Much simpler with any other solution It will change the "repair design" a bit. Repairing a base table will not automatically repair the MV. But is this bad at all? To be honest it was very hard for me to understand what I had to do to be sure that everything is repaired correctly. Recently I was told NOT to repair MV CFs but only to repair the base tables (see comment above from T Jake Luciani ]). This means one cannot just call "nodetool repair $keyspace" - this is complicated, not transparent and it sucks. I changed the behaviour in my own branch and let run the dtests for MVs. 2 tests failed: base_replica_repair_test of course fails due to the design change really_complex_repair_test fails because it intentionally times out the batch log. IMHO this is a bearable situation. It is comparable to resurrected tombstones when running a repair after GCGS expired. You also would not expect this to be magically fixed. gcgs default is 10 days and I can expect that anybody also repairs its MVs during that period, not only the base table. I'd suggest to simply delete these 2 tests. They prove nothing any more. 3. Rebuild + Decommision Similar impacts like bootstrap + repair I rolled out these changes on our production cluster (including CASSANDRA-12905 + CASSANDRA-12984 ). Before, I was not able to bootstrap a node with a load of roughly 280GB. Either it failed due to WTE (see 12905), it flooded the logs completely with hint delivery failures (also fixed in 12905) and after having fixed that, the bootstrap didn't even finish within 24h, why I cancelled it. After applying the mentioned changes, the bootstrap finished below 5:30h. Repairs also seem to run quite smoothly so far. Even though it does not fix CASSANDRA-12730 which is a different story. Any thoughts on this? Anybody there likes to review that?
          Hide
          brstgt Benjamin Roth added a comment -

          IMHO this is purely a matter of definition.

          • No one says that a base table and it's MVs have to be consistent to each other at any time.
          • CS generally promises eventual consistency and thats how it should be with base tables and MVs.
          • I MUST repair my data always before GCGS expires, so I have to repair base tables AND MVs. No matter if I do it in one run or separately - data will be consistent in the end.
          • If you need absolutely consistent data, you need CL_QUORUM (R/W) or CL_ALL (R), no matter if you are querying a base table or an MV. And if you don't, it really does not matter if your base table is inconsistent or your MV.

          Sum up:

          • Treating them as regular tables solves a LOT of issues
          • Increases transparency by applying the same principles for MVs and base table
          • Reduces special cases in code
            I see more advantages than disadvantages.
          Show
          brstgt Benjamin Roth added a comment - IMHO this is purely a matter of definition. No one says that a base table and it's MVs have to be consistent to each other at any time. CS generally promises eventual consistency and thats how it should be with base tables and MVs. I MUST repair my data always before GCGS expires, so I have to repair base tables AND MVs. No matter if I do it in one run or separately - data will be consistent in the end. If you need absolutely consistent data, you need CL_QUORUM (R/W) or CL_ALL (R), no matter if you are querying a base table or an MV. And if you don't, it really does not matter if your base table is inconsistent or your MV. Sum up: Treating them as regular tables solves a LOT of issues Increases transparency by applying the same principles for MVs and base table Reduces special cases in code I see more advantages than disadvantages.
          Hide
          brstgt Benjamin Roth added a comment -

          Another fact of the day:

          A usual repair (no outage or failures) of a KS with 2 tables and 1 MV was reduced from 11:15h to 6:15h by that patch

          Show
          brstgt Benjamin Roth added a comment - Another fact of the day: A usual repair (no outage or failures) of a KS with 2 tables and 1 MV was reduced from 11:15h to 6:15h by that patch
          Hide
          brstgt Benjamin Roth added a comment -

          Another example:

          A repair of a KS with approx 1.7TB total on 8 nodes with 7 base tables and 4 MVs increased from roughly 18:30h to 23:30h. I would explain it like that:
          This patch causes more validation work as it also validates MVs now, not only base tables. Depending of the ratio of base tables to MVs and the extent of detected inconsistencies and repair streams, it is possible that this patch performs worse than before.
          From my point of view it is still ok because it will never perform worse than if all the MVs were normal tables. But if there are a lot of streams e.g. due to a node failure recovery, this patch will perform much, much better than before.

          Show
          brstgt Benjamin Roth added a comment - Another example: A repair of a KS with approx 1.7TB total on 8 nodes with 7 base tables and 4 MVs increased from roughly 18:30h to 23:30h. I would explain it like that: This patch causes more validation work as it also validates MVs now, not only base tables. Depending of the ratio of base tables to MVs and the extent of detected inconsistencies and repair streams, it is possible that this patch performs worse than before. From my point of view it is still ok because it will never perform worse than if all the MVs were normal tables. But if there are a lot of streams e.g. due to a node failure recovery, this patch will perform much, much better than before.
          Hide
          pauloricardomg Paulo Motta added a comment -

          Thanks for your investigation Benjamin. As discussed on the mailing list and CASSANDRA-12905 we cannot ensure view consistency with sstable-based streaming of base/MVs in all scenarios, specially with repair, so I'm afraid this is not a viable solution just yet (needs to be further discussed). An alternative is to provide an option for repair to allow repairing base and MV separately when you know what you are doing©.

          While we work on improving MV streaming overhead in a separate ticket, I'd like to focus here on fixing the problem with MV/CDC repair stated in the description. What do you think of the previous proposal of just skipping the write path for base table mutations and keeping the sstables? While this is still a bit expensive it will allow users to incrementally repair moderately-sized MVs, specially after CASSANDRA-12905.

          Your observations from large scale MV streaming in the context of bootstrap will be pretty useful, would you mind adding them to the follow-up streaming improvement ticket for MVs?

          By the way, congrats for the baby!

          Show
          pauloricardomg Paulo Motta added a comment - Thanks for your investigation Benjamin. As discussed on the mailing list and CASSANDRA-12905 we cannot ensure view consistency with sstable-based streaming of base/MVs in all scenarios, specially with repair, so I'm afraid this is not a viable solution just yet (needs to be further discussed). An alternative is to provide an option for repair to allow repairing base and MV separately when you know what you are doing ©. While we work on improving MV streaming overhead in a separate ticket, I'd like to focus here on fixing the problem with MV/CDC repair stated in the description. What do you think of the previous proposal of just skipping the write path for base table mutations and keeping the sstables? While this is still a bit expensive it will allow users to incrementally repair moderately-sized MVs, specially after CASSANDRA-12905 . Your observations from large scale MV streaming in the context of bootstrap will be pretty useful, would you mind adding them to the follow-up streaming improvement ticket for MVs? By the way, congrats for the baby!
          Hide
          brstgt Benjamin Roth added a comment -

          HI Paulo,

          Thanks for the congrats!

          About your proposal to skip the base table mutations:
          I haven't analyzed it thoroughly (no time, you know) but my intuition says that there will be race conditions and possible inconsistencies as you "pick out" the base table mutation out of the lock phase. I guess to assert the base table <> view replica consistency you'd have to lock the whole CF while streaming a single SSTable to assert that the MV mutations are processed serial and no other base table mutations slip in from the mutation stage that mess with the consistency.
          As far as i can see, base table apply, base-read and MV mutations MUST be serialized (actually that's why there's a lock). Otherwise you will have stale MV rows again. This is why I think this proposal won't work. Or did I miss the point?

          CDC:
          This case should be quite simple. I think you don't need the write path at all and just have to write the incoming mutations to commit log additionally to streaming the sstable. In the worst case, server crashed, commit log replay leads to redundant and unrepaired entries but this should be a rare and recoverable situation.

          What do you think?

          Show
          brstgt Benjamin Roth added a comment - HI Paulo, Thanks for the congrats! About your proposal to skip the base table mutations: I haven't analyzed it thoroughly (no time, you know) but my intuition says that there will be race conditions and possible inconsistencies as you "pick out" the base table mutation out of the lock phase. I guess to assert the base table <> view replica consistency you'd have to lock the whole CF while streaming a single SSTable to assert that the MV mutations are processed serial and no other base table mutations slip in from the mutation stage that mess with the consistency. As far as i can see, base table apply, base-read and MV mutations MUST be serialized (actually that's why there's a lock). Otherwise you will have stale MV rows again. This is why I think this proposal won't work. Or did I miss the point? CDC: This case should be quite simple. I think you don't need the write path at all and just have to write the incoming mutations to commit log additionally to streaming the sstable. In the worst case, server crashed, commit log replay leads to redundant and unrepaired entries but this should be a rare and recoverable situation. What do you think?
          Hide
          vroldanbetan victor added a comment -

          Hi guys, would you guys discourage using MVs until this is fixed?

          Show
          vroldanbetan victor added a comment - Hi guys, would you guys discourage using MVs until this is fixed?
          Hide
          brstgt Benjamin Roth added a comment -

          It depends there are known issues. Mostly related to repair and
          streaming. MV basically work and do what you expect of them. But
          maintenance jobs may be slow and or painful. So the good old saying is
          true: you can use them if you understand them and know what you are doing.
          But don't expect them to be like plug and play

          Show
          brstgt Benjamin Roth added a comment - It depends there are known issues. Mostly related to repair and streaming. MV basically work and do what you expect of them. But maintenance jobs may be slow and or painful. So the good old saying is true: you can use them if you understand them and know what you are doing. But don't expect them to be like plug and play
          Hide
          vroldanbetan victor added a comment -

          Hey Benjamin,

          thanks for your feedback. At least you refer to "slow" and "painful", but not "massive failure / data loss", right?

          Still the measure on the complexity to handle is not clear to me. Is it more complex than handling "manually denormalized tables"? How much more and in which and in which aspects, if I may ask? I'm not a Cassandra expert, so I can't say I have deeper understanding of Tables than MVs. From what I interpret of your comment, MVs seem to be at least more complex to operate in a production cluster than normal tables. From a developer perspective, they are very appealing and remove some burden on the code.

          Are you aware of any description on how to deal with these "painful" scenarios?

          Thanks a lot!

          Show
          vroldanbetan victor added a comment - Hey Benjamin, thanks for your feedback. At least you refer to "slow" and "painful", but not "massive failure / data loss", right? Still the measure on the complexity to handle is not clear to me. Is it more complex than handling "manually denormalized tables"? How much more and in which and in which aspects, if I may ask? I'm not a Cassandra expert, so I can't say I have deeper understanding of Tables than MVs. From what I interpret of your comment, MVs seem to be at least more complex to operate in a production cluster than normal tables. From a developer perspective, they are very appealing and remove some burden on the code. Are you aware of any description on how to deal with these "painful" scenarios? Thanks a lot!
          Hide
          brstgt Benjamin Roth added a comment - - edited

          Hi Victor,

          We use MVs in Production with billions of records without known data loss.
          Painful + slow refers to repairs and range movements (e.g. bootstrap +
          decommission). Also (as mentioned in this ticket) incremental repairs dont
          work, so full repair creates some overhead. Until 3.10 there are bugs
          leading to write timeouts, even to NPEs and completely blocked mutation
          stages. This could even bring your cluster down. In 3.10 some issues have
          been resolved - actually we use a patched trunk version which is 1-2 months
          old.

          Depending on your model, MVs can help a lot from a developer perspective.
          Some cases are very resource intensive to manage without MVs, requiring
          distributed locks and/or CAS.
          For append-only workloads, it may be simpler to NOT use MVs at the moment.
          They aren't very complex and MVs wont help that much compared to the
          problems that may raise with them.

          Painful scenarios: There is no recipe for that. You may or may not
          encounter performance issues, depending on your model and your workload.
          I'd recommend not to use MVs that use a different partition key on the MV
          than on the base table as this requires inter-node communication for EVERY
          write operation. So you can easily kill your cluster with bulk operations
          (like in streaming).

          At the moment our cluster runs stable but it took months to find all the
          bottlenecks, race conditions, resume from failures and so on. So my
          recommendation: You can get it work but you need time and you should not
          start with critical data, at least if it is not backed by another stable
          storage. And you should use 3.10 when it is finally released or build your
          own version from trunk. I would not recommend to use < 3.10 for MVs.

          Btw.: Our own patched version does some dirty tricks, that may lead to
          inconsistencies in some situations but we prefer some possible
          inconsistencies (we can deal with) over performance bottlenecks. I created
          several tickets to improve MV performance in some streaming situations but
          it will take some time to really improve that situation.

          Does this answer your question?

          Show
          brstgt Benjamin Roth added a comment - - edited Hi Victor, We use MVs in Production with billions of records without known data loss. Painful + slow refers to repairs and range movements (e.g. bootstrap + decommission). Also (as mentioned in this ticket) incremental repairs dont work, so full repair creates some overhead. Until 3.10 there are bugs leading to write timeouts, even to NPEs and completely blocked mutation stages. This could even bring your cluster down. In 3.10 some issues have been resolved - actually we use a patched trunk version which is 1-2 months old. Depending on your model, MVs can help a lot from a developer perspective. Some cases are very resource intensive to manage without MVs, requiring distributed locks and/or CAS. For append-only workloads, it may be simpler to NOT use MVs at the moment. They aren't very complex and MVs wont help that much compared to the problems that may raise with them. Painful scenarios: There is no recipe for that. You may or may not encounter performance issues, depending on your model and your workload. I'd recommend not to use MVs that use a different partition key on the MV than on the base table as this requires inter-node communication for EVERY write operation. So you can easily kill your cluster with bulk operations (like in streaming). At the moment our cluster runs stable but it took months to find all the bottlenecks, race conditions, resume from failures and so on. So my recommendation: You can get it work but you need time and you should not start with critical data, at least if it is not backed by another stable storage. And you should use 3.10 when it is finally released or build your own version from trunk. I would not recommend to use < 3.10 for MVs. Btw.: Our own patched version does some dirty tricks, that may lead to inconsistencies in some situations but we prefer some possible inconsistencies (we can deal with) over performance bottlenecks. I created several tickets to improve MV performance in some streaming situations but it will take some time to really improve that situation. Does this answer your question?
          Hide
          vroldanbetan victor added a comment -

          Hi Benjamin,

          thanks for the overall awesomeness You response is very helpful!

          Our use-case exploits atomicity of MVs on a non-append only scenario. Aside from the less code to write in our application, it allows us skipping multi-partition batches to achieve atomicity. It's still not 100% clear to me, but the cluster seems to be exposed to less stress on atomicity/denormalization with MVs than with multi-partition batches. (at least DataStax indicates there are performance gains compared with manual denormalization scenario, not even counting manual denormalization with batches, see http://www.datastax.com/dev/blog/materialized-view-performance-in-cassandra-3-x. Is this assumption correct?

          >I'd recommend not to use MVs that use a different partition key on the MV
          >than on the base table as this requires inter-node communication for EVERY
          >write operation. So you can easily kill your cluster with bulk operations
          >(like in streaming).

          Excuse my ignorance, but isn't having a different partition key the point of denormalization on MVs (to have different read paths)? Would this node coordination be worse or the same on a multi-partition batch scenario?

          Given our system stores critical information, we've decided to skip MVs altogether until the feature becomes more "ops friendly" on production.

          Thanks a lot!
          Víctor

          Show
          vroldanbetan victor added a comment - Hi Benjamin, thanks for the overall awesomeness You response is very helpful! Our use-case exploits atomicity of MVs on a non-append only scenario. Aside from the less code to write in our application, it allows us skipping multi-partition batches to achieve atomicity. It's still not 100% clear to me, but the cluster seems to be exposed to less stress on atomicity/denormalization with MVs than with multi-partition batches. (at least DataStax indicates there are performance gains compared with manual denormalization scenario, not even counting manual denormalization with batches, see http://www.datastax.com/dev/blog/materialized-view-performance-in-cassandra-3-x . Is this assumption correct? >I'd recommend not to use MVs that use a different partition key on the MV >than on the base table as this requires inter-node communication for EVERY >write operation. So you can easily kill your cluster with bulk operations >(like in streaming). Excuse my ignorance, but isn't having a different partition key the point of denormalization on MVs (to have different read paths)? Would this node coordination be worse or the same on a multi-partition batch scenario? Given our system stores critical information, we've decided to skip MVs altogether until the feature becomes more "ops friendly" on production. Thanks a lot! Víctor
          Hide
          spodxx@gmail.com Stefan Podkowinski added a comment -

          Can you please take this kind of discussion to user@cassandra.apache.org? This is an issue tracker and you're email spamming all people subscribed to the ticket to get updates on the actual ticket progress. Thank you.

          Show
          spodxx@gmail.com Stefan Podkowinski added a comment - Can you please take this kind of discussion to user@cassandra.apache.org? This is an issue tracker and you're email spamming all people subscribed to the ticket to get updates on the actual ticket progress. Thank you.
          Hide
          brstgt Benjamin Roth added a comment -

          Hi Victor,

          1. Performance:
          Performance can be better with MV than with batches but this depends on the read performance of the base table vs. the performance overhead for batches, which also is dependent on the batch size and the batchlog performance. An MV always creates a read before write, so it depends much on this how the MV performs. The final write operation of the MV update is fast as it works like a regular (local) write.

          2. Partition Keys and remote MV updates
          You are of course right, that this may be a common use case. You have to use it carefully. Maybe the situation already has improved by some bugfixes. The last time I tried was some months ago. To be fair I have to mention that back then there was a bug with a race condition that could deadlock the whole mutation stage. With "remote MVs" we ran very frequently into this situation during bootstraps (for example). This has to do with MV-locks and probably the much longer lock-time when the MV update is remote, leading to more lock-contention. With remote MV updates, the current write request also depends on the performance of remote nodes. This can lead to write timeouts much faster as long as the (remote) MV update is part of the write request and not deferred. So again: Maybe this situation has improved meanwhile but I personally didn't require it so I was able to use normal tables to "twist" the PK. We currently use MVs only to add a field to the primary key for sorting.

          Show
          brstgt Benjamin Roth added a comment - Hi Victor, 1. Performance: Performance can be better with MV than with batches but this depends on the read performance of the base table vs. the performance overhead for batches, which also is dependent on the batch size and the batchlog performance. An MV always creates a read before write, so it depends much on this how the MV performs. The final write operation of the MV update is fast as it works like a regular (local) write. 2. Partition Keys and remote MV updates You are of course right, that this may be a common use case. You have to use it carefully. Maybe the situation already has improved by some bugfixes. The last time I tried was some months ago. To be fair I have to mention that back then there was a bug with a race condition that could deadlock the whole mutation stage. With "remote MVs" we ran very frequently into this situation during bootstraps (for example). This has to do with MV-locks and probably the much longer lock-time when the MV update is remote, leading to more lock-contention. With remote MV updates, the current write request also depends on the performance of remote nodes. This can lead to write timeouts much faster as long as the (remote) MV update is part of the write request and not deferred. So again: Maybe this situation has improved meanwhile but I personally didn't require it so I was able to use normal tables to "twist" the PK. We currently use MVs only to add a field to the primary key for sorting.
          Hide
          brstgt Benjamin Roth added a comment - - edited

          I am about to hack a proof of concept for that issue.

          Concept:
          Each mutation and each partition update have a "repairedAt" flag. This will be passed along through the whole write path like MV updates and serialization for remote MV updates. Then repair + non-repair mutations have to be separated in memtables and flushed to separate SSTables. From what I can see it should be easier to maintain a memtable each for repaired and non-repaired data than tracking the repair state within a memtable.
          Passing repair state to replicas isn't even necessary as replicas should not be repaired directly anyway, so no need for a repairedAt state.

          My question is:
          How important is the exact value of "repairedAt". Is it possible to merge updates with different repair timestamps into a single memtable and finally flush them to an SSTable with repairedAt set to the latest or earliest repairedAt timestamps of all mutations in the memtable?
          Or would that produce repair-inconsistencies or sth?

          Any feedback?

          Show
          brstgt Benjamin Roth added a comment - - edited I am about to hack a proof of concept for that issue. Concept: Each mutation and each partition update have a "repairedAt" flag. This will be passed along through the whole write path like MV updates and serialization for remote MV updates. Then repair + non-repair mutations have to be separated in memtables and flushed to separate SSTables. From what I can see it should be easier to maintain a memtable each for repaired and non-repaired data than tracking the repair state within a memtable. Passing repair state to replicas isn't even necessary as replicas should not be repaired directly anyway, so no need for a repairedAt state. My question is: How important is the exact value of "repairedAt". Is it possible to merge updates with different repair timestamps into a single memtable and finally flush them to an SSTable with repairedAt set to the latest or earliest repairedAt timestamps of all mutations in the memtable? Or would that produce repair-inconsistencies or sth? Any feedback?
          Hide
          spodxx@gmail.com Stefan Podkowinski added a comment -

          Before moving on in the discussion, I think Paulo Motta's suggestion still stands and deserves some more thoughts:

          As discussed on the mailing list and CASSANDRA-12905 we cannot ensure view consistency with sstable-based streaming of base/MVs in all scenarios, specially with repair, so I'm afraid this is not a viable solution just yet (needs to be further discussed). An alternative is to provide an option for repair to allow repairing base and MV separately when you know what you are doing©.

          Actually having to repair both, base and view table to get back into a consistent state, is something I would expect as a user and something that would happen anyway during my regular repairs, whereas the mutation based approach is probably unheard of by most users and totally unexpected, given how we handled anti-entropy in the past. Although we talk about "views", I think most users will understand that these are actually separate tables and should be handled as that from a operational perspective. So do we really want to introduce additional complexity that may bite us in the back at some point, instead of just relying on regular repairs?

          Show
          spodxx@gmail.com Stefan Podkowinski added a comment - Before moving on in the discussion, I think Paulo Motta 's suggestion still stands and deserves some more thoughts: As discussed on the mailing list and CASSANDRA-12905 we cannot ensure view consistency with sstable-based streaming of base/MVs in all scenarios, specially with repair, so I'm afraid this is not a viable solution just yet (needs to be further discussed). An alternative is to provide an option for repair to allow repairing base and MV separately when you know what you are doing©. Actually having to repair both, base and view table to get back into a consistent state, is something I would expect as a user and something that would happen anyway during my regular repairs, whereas the mutation based approach is probably unheard of by most users and totally unexpected, given how we handled anti-entropy in the past. Although we talk about "views", I think most users will understand that these are actually separate tables and should be handled as that from a operational perspective. So do we really want to introduce additional complexity that may bite us in the back at some point, instead of just relying on regular repairs?
          Hide
          tjake T Jake Luciani added a comment -

          Just to restate the reason we need to put MV base table mutations through the regular write path (vs just streaming the MV and the base) from CASSANDRA-6477

          Let's say you put non-PK columns A and B into a Materialized view. Replica 1 has All of Column A, Replica 2 has All of column B. The build would end up with no data in the MV. You would need to subsequentally repair the data to build the MV.

          Show
          tjake T Jake Luciani added a comment - Just to restate the reason we need to put MV base table mutations through the regular write path (vs just streaming the MV and the base) from CASSANDRA-6477 Let's say you put non-PK columns A and B into a Materialized view. Replica 1 has All of Column A, Replica 2 has All of column B. The build would end up with no data in the MV. You would need to subsequentally repair the data to build the MV.
          Hide
          spodxx@gmail.com Stefan Podkowinski added a comment -

          If you build from an inconsistent node and some missing data never makes it into a view, regular repairs alone won't be able to fix that. Is that your point? Although I see that this could happen, I'd instead still prefer to just tell the user to do a full repair before creating a view, if that would be a big issue.

          Show
          spodxx@gmail.com Stefan Podkowinski added a comment - If you build from an inconsistent node and some missing data never makes it into a view, regular repairs alone won't be able to fix that. Is that your point? Although I see that this could happen, I'd instead still prefer to just tell the user to do a full repair before creating a view, if that would be a big issue.
          Hide
          brstgt Benjamin Roth added a comment -

          A repair must go through the write path expect for some special cases. I also first had the idea to avoid it completely but in discussion with Paulo Motta it turned out that this may introduce inconsistencies that these could only be fixed by a view rebuild because it leaves stale rows.
          I know that all this stuff is totally counter-intuitive but just streaming "blindly" all sstables (incl. MV tables) down is not correct. This is why I am trying to improve the mutation based approached.

          If the Sstables for MVs get corrupted or lost, the only way to fix it is to rebuild that view again. There is no way (at least none I see atm) that would consistenly repair a view from other nodes.

          The underlying principle is:

          • A view must always be consistent to its base-table
          • A view does not have to be consistent among nodes, thats handled by repairing the base table

          Thats also why you don't have to run a repair before building a view. Nevertheless it would not help anyway because you NEVER have a 100% guaranteed consistent state. A repair only guarantees consistency until the point of repair.

          The "know what you are doing" option is offered by CASSANDRA-13066 btw.
          In this ticket I also adopted the election of CFs (tables + mvs) when doing a keyspace repair depending if the MV is repaired by stream or by mutation.

          Show
          brstgt Benjamin Roth added a comment - A repair must go through the write path expect for some special cases. I also first had the idea to avoid it completely but in discussion with Paulo Motta it turned out that this may introduce inconsistencies that these could only be fixed by a view rebuild because it leaves stale rows. I know that all this stuff is totally counter-intuitive but just streaming "blindly" all sstables (incl. MV tables) down is not correct. This is why I am trying to improve the mutation based approached. If the Sstables for MVs get corrupted or lost, the only way to fix it is to rebuild that view again. There is no way (at least none I see atm) that would consistenly repair a view from other nodes. The underlying principle is: A view must always be consistent to its base-table A view does not have to be consistent among nodes, thats handled by repairing the base table Thats also why you don't have to run a repair before building a view. Nevertheless it would not help anyway because you NEVER have a 100% guaranteed consistent state. A repair only guarantees consistency until the point of repair. The "know what you are doing" option is offered by CASSANDRA-13066 btw. In this ticket I also adopted the election of CFs (tables + mvs) when doing a keyspace repair depending if the MV is repaired by stream or by mutation.
          Hide
          brstgt Benjamin Roth added a comment -

          For detailed explanation an excerpt from that discussion:


          ... there are still possible scenarios where it's possible to break consistency by repairing the base and the view separately even with QUORUM writes:

          Initial state:

          Base replica A:

          {k0=v0, ts=0}
          Base replica B: {k0=v0, ts=0}

          Base replica C:

          {k0=v0, ts=0}
          View paired replica A: {v1=k0, ts=0}
          View paired replica B: {v0=k0, ts=0}
          View paired replica C: {v0=k0, ts=0}

          Base replica A receives write {k1=v1, ts=1}, propagates to view paired replica A and dies.

          Current state is:
          Base replica A: {k1=v1, ts=1}
          Base replica B: {k0=v0, ts=0}

          Base replica C:

          {k0=v0, ts=0}

          View paired replica A:

          {v1=k1, ts=1}
          View paired replica B: {v0=k0, ts=0}
          View paired replica C: {v0=k0, ts=0}

          Base replica B and C receives write {k2=v2, ts=2}, write to their paired replica. Write is successful at QUORUM.

          Current state is:
          Base replica A: {k1=v1, ts=1}
          Base replica B: {k2=v2, ts=2}
          Base replica C: {k2=v2, ts=2}
          View paired replica A: {v1=k1, ts=1}

          View paired replica B:

          {v2=k2, ts=2}
          View paired replica C: {v2=k2, ts=2}

          A returns from the dead. Repair base table:
          Base replica A:

          {k2=v2, ts=2}
          Base replica B: {k2=v2, ts=2}

          Base replica C:

          {k2=v2, ts=2}

          Repair MV:
          View paired replica A:

          {v1=k1, ts=1} and {v2=k2, ts=2}
          View paired replica B: {v1=k1, ts=1}

          and

          {v2=k2, ts=2}
          View paired replica C: {v1=k1, ts=1} and {v2=k2, ts=2}

          So, this requires replica A to generate a tombstone for

          {v1=k1, ts=1}

          during repair of base table.

          Show
          brstgt Benjamin Roth added a comment - For detailed explanation an excerpt from that discussion: ... there are still possible scenarios where it's possible to break consistency by repairing the base and the view separately even with QUORUM writes: Initial state: Base replica A: {k0=v0, ts=0} Base replica B: {k0=v0, ts=0} Base replica C: {k0=v0, ts=0} View paired replica A: {v1=k0, ts=0} View paired replica B: {v0=k0, ts=0} View paired replica C: {v0=k0, ts=0} Base replica A receives write {k1=v1, ts=1}, propagates to view paired replica A and dies. Current state is: Base replica A: {k1=v1, ts=1} Base replica B: {k0=v0, ts=0} Base replica C: {k0=v0, ts=0} View paired replica A: {v1=k1, ts=1} View paired replica B: {v0=k0, ts=0} View paired replica C: {v0=k0, ts=0} Base replica B and C receives write {k2=v2, ts=2}, write to their paired replica. Write is successful at QUORUM. Current state is: Base replica A: {k1=v1, ts=1} Base replica B: {k2=v2, ts=2} Base replica C: {k2=v2, ts=2} View paired replica A: {v1=k1, ts=1} View paired replica B: {v2=k2, ts=2} View paired replica C: {v2=k2, ts=2} A returns from the dead. Repair base table: Base replica A: {k2=v2, ts=2} Base replica B: {k2=v2, ts=2} Base replica C: {k2=v2, ts=2} Repair MV: View paired replica A: {v1=k1, ts=1} and {v2=k2, ts=2} View paired replica B: {v1=k1, ts=1} and {v2=k2, ts=2} View paired replica C: {v1=k1, ts=1} and {v2=k2, ts=2} So, this requires replica A to generate a tombstone for {v1=k1, ts=1} during repair of base table.
          Hide
          brstgt Benjamin Roth added a comment -

          Maybe my former question perished:

          What effect does the repairedAt flag have for future repairs except that a non-zero value means, that a table has been repaired at some time?
          I am happy about any code references.

          Show
          brstgt Benjamin Roth added a comment - Maybe my former question perished: What effect does the repairedAt flag have for future repairs except that a non-zero value means, that a table has been repaired at some time? I am happy about any code references.
          Hide
          spodxx@gmail.com Stefan Podkowinski added a comment -

          The repairedAt value stored in each sstable's metadata will indicate the time the sstable has been repaired and nothing more. The basic idea behind tracking such a timestamp value was that once a sstable has been repaired, the containing data is consistent in a way that no node would miss any data such as tombstones and therefore we won't have to repair this data ever again. This is what makes incremental repairs possible. As simple as the idea is, things start to become a bit tricky when we want to merge data, either by compactions or in case of this ticket, by applying mutations. The way compactions have been implemented is that we now have two pools of sstables that will be compacted independently from each other: unrepaired and repaired data. Sstables in both pools can be compacted together just fine and in case of repaired data, the lowest timestamp of the compaction candidates will be used as output. However, the actual timestamp value currently doesn't really matter, as we just use it to track if it the sstables has been repaired or not. Future repairs may be executed based on unrepaired only (incremental) or both unrepaired and repaired (full) data. Does this answer your question?

          Show
          spodxx@gmail.com Stefan Podkowinski added a comment - The repairedAt value stored in each sstable's metadata will indicate the time the sstable has been repaired and nothing more. The basic idea behind tracking such a timestamp value was that once a sstable has been repaired, the containing data is consistent in a way that no node would miss any data such as tombstones and therefore we won't have to repair this data ever again. This is what makes incremental repairs possible. As simple as the idea is, things start to become a bit tricky when we want to merge data, either by compactions or in case of this ticket, by applying mutations. The way compactions have been implemented is that we now have two pools of sstables that will be compacted independently from each other: unrepaired and repaired data. Sstables in both pools can be compacted together just fine and in case of repaired data, the lowest timestamp of the compaction candidates will be used as output. However, the actual timestamp value currently doesn't really matter, as we just use it to track if it the sstables has been repaired or not. Future repairs may be executed based on unrepaired only (incremental) or both unrepaired and repaired (full) data. Does this answer your question?
          Hide
          brstgt Benjamin Roth added a comment -

          Just perfect! Thats EXACTLY what I wanted to know and it helps me to continue to work on that ticket. I started some proof of concept work but it still needs some finalizing and exhaustive testing.

          Concept is quite simple in theory (hopefully in reality, too):
          Each SSTable may now contain more than one active memtable each for unrepaired and repaired data (like compaction pools). The repaired memtable does not have to be resident all the time, only during repairs, so my intention was to create it on demand and not to automatically re-create on after flush. To make things simple for a start my intention was to apply flush behaviour to both memtables. Either both or none is flushed. Maybe this could be optimized in future.

          Show
          brstgt Benjamin Roth added a comment - Just perfect! Thats EXACTLY what I wanted to know and it helps me to continue to work on that ticket. I started some proof of concept work but it still needs some finalizing and exhaustive testing. Concept is quite simple in theory (hopefully in reality, too): Each SSTable may now contain more than one active memtable each for unrepaired and repaired data (like compaction pools). The repaired memtable does not have to be resident all the time, only during repairs, so my intention was to create it on demand and not to automatically re-create on after flush. To make things simple for a start my intention was to apply flush behaviour to both memtables. Either both or none is flushed. Maybe this could be optimized in future.
          Hide
          tjake T Jake Luciani added a comment - - edited

          Proposing another (maybe simpler solution to this problem):

          We currently replay the base table mutations through the write path and drop the streamed table.

          Instead of this we could use the streamed base table with repairedAt flag like any other repair.
          Then, just as we do now replay the mutations through the write path, only create a mutation flag that only updates the MVs and not the base table.

          This means the MV wouldn't be incrementally repairable but really you shouldn't need to repair the MVs unless there is dataloss.

          Show
          tjake T Jake Luciani added a comment - - edited Proposing another (maybe simpler solution to this problem): We currently replay the base table mutations through the write path and drop the streamed table. Instead of this we could use the streamed base table with repairedAt flag like any other repair. Then, just as we do now replay the mutations through the write path, only create a mutation flag that only updates the MVs and not the base table. This means the MV wouldn't be incrementally repairable but really you shouldn't need to repair the MVs unless there is dataloss.
          Hide
          tjake T Jake Luciani added a comment -

          I think that idea won't fly actually. The problem is if you add the sstable first, the MV updates won't reflect the before state. If you add it after the MV updates there will be a time when the MV has data the base table does not. Maybe the latter isn't a deal breaker but more chance for problems.

          Show
          tjake T Jake Luciani added a comment - I think that idea won't fly actually. The problem is if you add the sstable first, the MV updates won't reflect the before state. If you add it after the MV updates there will be a time when the MV has data the base table does not. Maybe the latter isn't a deal breaker but more chance for problems.
          Hide
          brstgt Benjamin Roth added a comment -

          I also had this idea but it wont work. It will totally break base <> MV consistency. Except: You lock all involved partitions for the whole process. But that would create insanely long locks and a extremely high contention

          Show
          brstgt Benjamin Roth added a comment - I also had this idea but it wont work. It will totally break base <> MV consistency. Except: You lock all involved partitions for the whole process. But that would create insanely long locks and a extremely high contention
          Hide
          brstgt Benjamin Roth added a comment -

          Btw.:
          My concept seems to work, but there is one question left:
          Why does a StreamSession create unrepaired SSTables?
          IncomingFileMessage
          => Creates RangeAwareSSTableWriter:97
          => cfs.createSSTableMultiWriter
          ...
          => CompactionStrategyManager.createSSTableMultiWriter:185

          Will it be marked as repaired later? If so, where/when?

          Why I ask:
          The received SSTable has the repairedFlag in RangeAwareSSTableWriter and it's header but it is lost when the SSTable is finished and returned as SSTableReader.

          Show
          brstgt Benjamin Roth added a comment - Btw.: My concept seems to work, but there is one question left: Why does a StreamSession create unrepaired SSTables? IncomingFileMessage => Creates RangeAwareSSTableWriter:97 => cfs.createSSTableMultiWriter ... => CompactionStrategyManager.createSSTableMultiWriter:185 Will it be marked as repaired later? If so, where/when? Why I ask: The received SSTable has the repairedFlag in RangeAwareSSTableWriter and it's header but it is lost when the SSTable is finished and returned as SSTableReader.
          Show
          brstgt Benjamin Roth added a comment - https://github.com/apache/cassandra/compare/trunk...Jaumo:CASSANDRA-12888 Some dtest assertions: https://github.com/riptano/cassandra-dtest/compare/master...Jaumo:CASSANDRA-12888?expand=1
          Hide
          brstgt Benjamin Roth added a comment - - edited

          Review would be much appreciated. Don't know if Paulo Motta still wants to do the review. Please give me some feedback, thanks!

          Show
          brstgt Benjamin Roth added a comment - - edited Review would be much appreciated. Don't know if Paulo Motta still wants to do the review. Please give me some feedback, thanks!
          Hide
          brstgt Benjamin Roth added a comment -

          Paulo Motta Have you been able to take a look at the patch, yet? If not, maybe someone else wants to review it? It's there for 2 months now.

          The patch introduces multiple (active) memtables per CF. This could also help in other situations like:
          https://issues.apache.org/jira/browse/CASSANDRA-13290
          https://issues.apache.org/jira/browse/CASSANDRA-12991

          Show
          brstgt Benjamin Roth added a comment - Paulo Motta Have you been able to take a look at the patch, yet? If not, maybe someone else wants to review it? It's there for 2 months now. The patch introduces multiple (active) memtables per CF. This could also help in other situations like: https://issues.apache.org/jira/browse/CASSANDRA-13290 https://issues.apache.org/jira/browse/CASSANDRA-12991
          Hide
          pauloricardomg Paulo Motta added a comment -

          Sorry for the delay here. The approach looks good but the devil is on the details, so we need to be careful about introducing changes in the critical path.

          I will take a look this week.

          Show
          pauloricardomg Paulo Motta added a comment - Sorry for the delay here. The approach looks good but the devil is on the details, so we need to be careful about introducing changes in the critical path. I will take a look this week.
          Hide
          brstgt Benjamin Roth added a comment -

          I am absolutely aware of that! That's why I also added some tests. All unit tests ran well so far. I also ran a bunch of probably related dtests like the MV test suite. It also looked good. Nevertheless, I don't want to urge you, take the time you need! I appreciate any feedback!

          Show
          brstgt Benjamin Roth added a comment - I am absolutely aware of that! That's why I also added some tests. All unit tests ran well so far. I also ran a bunch of probably related dtests like the MV test suite. It also looked good. Nevertheless, I don't want to urge you, take the time you need! I appreciate any feedback!
          Hide
          pauloricardomg Paulo Motta added a comment -

          First of all thank you for your effort on this issue and sorry for the delay. I had an initial look at the patch and while I didn't spot any immediate flaws, it is not clear from looking at the code alone if this is not going to have any unintended repercussions in some other component (commit log archiving, memtable lifecyle/flushing, read/write path) since this is changing a core assumption that a given CF/table has a single active memtable (or at most 2, during flush) to having multiple active memtables. While in an ideal world correctness would be ensured by tests, we know well our test coverage is far from ideal so we need to make sure the approach is well discussed and all edge cases understood before jumping into actual code.

          One way to facilitate communication in changes like this is through a design doc, which is not generally required for small changes, but since we're modifying a core component it is important to spell out all the details in a design doc to make sure we're covering everything and to communicate the change to interested parties for feedback without exposing them to actual code. With this said, it would be great if you could prepare a short design doc with proposed changes and justification, how the memtable and flushing lifecycle is affected by this, impacted areas, API changes/additions, test plan before we proceed.

          Overall I think you are on the right path but your approach needs some adjustments. See comments below:

          • I think explicitly managing the repaired memtable lifecycle from the repair job is less prone to errors/leaks than let the memtable be created on demand when the mutation is being applied and removed organically by automatic flush. My suggestion would be to create the memtable at the start of the repair job when necessary, append mutations to it and then flush/remove the additional memtable at the end of the job.
          • In order for this to play along well with CASSANDRA-9143, the mutation would somehow need to include the repair session id which would be used to retrieve the repairedAt from ActiveRepairService. In this arrangement we would keep a memtable per active repair session to ensure mutations from the same job are flushed to the same sstable. Tracker.getMemtableFor would fetch the right memtable for that repair session id, and return the ordinary unrepaired memtable if there is not memtable created for that repair session id (and probably log a warning).
          • You modified flush to operate on all memtables of a given table, but we probably need to keep flush to a single memtable since we don't want to flush all memtables every time. For instance, if memtable_cleanup_threshold is reached you will want to flush only the largest memtable, not all memtables for a given table. Memtable will have different creation times, so you will only want to flush the expired memtable and not all memtables for a given table. For repair you will only want to flush unrepaired memtables. Sometimes you will need to flush all memtables of a given table (such as nodetool flush), but not every time. With this we keep the flush lifecyle pretty much the same it is today, there will be at most 2 active memtables for unrepaired data and at most 2 active memtables for each repair session (one being flushed and maybe its replacement), what will facilitate understanding and reduce surface to unintended side-effects. All of these changes need to be spelled out on the design doc as well as javadoc to make sure it's clearly communicated.

          Marcus Eriksson Blake Eggleston Does the above sound reasonable or do you have any other suggestions or remarks? Marcus, is this more or less how you planned to add repaired memtables on CASSANDRA-8911 or did you have something else in mind?

          Show
          pauloricardomg Paulo Motta added a comment - First of all thank you for your effort on this issue and sorry for the delay. I had an initial look at the patch and while I didn't spot any immediate flaws, it is not clear from looking at the code alone if this is not going to have any unintended repercussions in some other component (commit log archiving, memtable lifecyle/flushing, read/write path) since this is changing a core assumption that a given CF/table has a single active memtable (or at most 2, during flush) to having multiple active memtables. While in an ideal world correctness would be ensured by tests, we know well our test coverage is far from ideal so we need to make sure the approach is well discussed and all edge cases understood before jumping into actual code. One way to facilitate communication in changes like this is through a design doc, which is not generally required for small changes, but since we're modifying a core component it is important to spell out all the details in a design doc to make sure we're covering everything and to communicate the change to interested parties for feedback without exposing them to actual code. With this said, it would be great if you could prepare a short design doc with proposed changes and justification, how the memtable and flushing lifecycle is affected by this, impacted areas, API changes/additions, test plan before we proceed. Overall I think you are on the right path but your approach needs some adjustments. See comments below: I think explicitly managing the repaired memtable lifecycle from the repair job is less prone to errors/leaks than let the memtable be created on demand when the mutation is being applied and removed organically by automatic flush. My suggestion would be to create the memtable at the start of the repair job when necessary, append mutations to it and then flush/remove the additional memtable at the end of the job. In order for this to play along well with CASSANDRA-9143 , the mutation would somehow need to include the repair session id which would be used to retrieve the repairedAt from ActiveRepairService . In this arrangement we would keep a memtable per active repair session to ensure mutations from the same job are flushed to the same sstable. Tracker.getMemtableFor would fetch the right memtable for that repair session id, and return the ordinary unrepaired memtable if there is not memtable created for that repair session id (and probably log a warning). You modified flush to operate on all memtables of a given table, but we probably need to keep flush to a single memtable since we don't want to flush all memtables every time. For instance, if memtable_cleanup_threshold is reached you will want to flush only the largest memtable, not all memtables for a given table. Memtable will have different creation times, so you will only want to flush the expired memtable and not all memtables for a given table. For repair you will only want to flush unrepaired memtables. Sometimes you will need to flush all memtables of a given table (such as nodetool flush), but not every time. With this we keep the flush lifecyle pretty much the same it is today, there will be at most 2 active memtables for unrepaired data and at most 2 active memtables for each repair session (one being flushed and maybe its replacement), what will facilitate understanding and reduce surface to unintended side-effects. All of these changes need to be spelled out on the design doc as well as javadoc to make sure it's clearly communicated. Marcus Eriksson Blake Eggleston Does the above sound reasonable or do you have any other suggestions or remarks? Marcus, is this more or less how you planned to add repaired memtables on CASSANDRA-8911 or did you have something else in mind?
          Hide
          pauloricardomg Paulo Motta added a comment -

          Another thing is that this will only be able to go in trunk/4.x, so we may need to think of a short-term workaround to the actual issue like disallowing incremental repairs for tables with MVs and CDC for 3.X and add repaired memtables in a separate issue for 4.x.

          Show
          pauloricardomg Paulo Motta added a comment - Another thing is that this will only be able to go in trunk/4.x, so we may need to think of a short-term workaround to the actual issue like disallowing incremental repairs for tables with MVs and CDC for 3.X and add repaired memtables in a separate issue for 4.x.
          Hide
          bdeggleston Blake Eggleston added a comment -

          +1 on the design doc, or at least a comment thoroughly explaining the approach, since this seems to be a bit different from what was discussed above. Paulo, you're correct about #9143, we're going to need to preserve the pendingRepair value as well.

          At a high level though, I think the patch, and some of the solutions discussed, are approaching the problem from the wrong direction. Having to bend and repurpose storage system components like this to make streaming and repair work properly, to me, indicates there’s a problem in the MV implementation. Specifically, we shouldn't have to rewrite data that was just streamed. I think it would be better to focus on fixing those problems, instead of adding complexity to the storage layer to work around them. The contents of Mutations and UnfilteredRowIterators are pretty similar, decoupling the MV code from the write path should make it possible to apply the MV logic to sstable data without having to run it through the entire write path.

          Show
          bdeggleston Blake Eggleston added a comment - +1 on the design doc, or at least a comment thoroughly explaining the approach, since this seems to be a bit different from what was discussed above. Paulo, you're correct about #9143, we're going to need to preserve the pendingRepair value as well. At a high level though, I think the patch, and some of the solutions discussed, are approaching the problem from the wrong direction. Having to bend and repurpose storage system components like this to make streaming and repair work properly, to me, indicates there’s a problem in the MV implementation. Specifically, we shouldn't have to rewrite data that was just streamed. I think it would be better to focus on fixing those problems, instead of adding complexity to the storage layer to work around them. The contents of Mutations and UnfilteredRowIterators are pretty similar, decoupling the MV code from the write path should make it possible to apply the MV logic to sstable data without having to run it through the entire write path.
          Hide
          krummas Marcus Eriksson added a comment -

          I guess it looks pretty similar to what I wanted in CASSANDRA-8911

          I agree with Blake that this feels like the wrong approach, but I can't give a better suggestion since I have not touched the MV/CDC code much

          Show
          krummas Marcus Eriksson added a comment - I guess it looks pretty similar to what I wanted in CASSANDRA-8911 I agree with Blake that this feels like the wrong approach, but I can't give a better suggestion since I have not touched the MV/CDC code much

            People

            • Assignee:
              brstgt Benjamin Roth
              Reporter:
              spodxx@gmail.com Stefan Podkowinski
              Reviewer:
              Paulo Motta
            • Votes:
              1 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

              • Created:
                Updated:

                Development