Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 4.x
    • Component/s: CQL, Materialized Views
    • Labels:
      None

      Description

      Many common workloads are append-only: that is, they insert new rows but do not update existing ones. However, Cassandra has no way to infer this and so it must treat all tables as if they may experience updates in the future.

      If we added syntax to tell Cassandra about this (WITH INSERTS ONLY for instance) then we could do a number of optimizations:

      • Compaction would only need to worry about defragmenting partitions, not rows. We could default to DTCS or similar.
      • CollationController could stop scanning sstables as soon as it finds a matching row
      • Most importantly, materialized views wouldn't need to worry about deleting prior values, which would eliminate the majority of the MV overhead

        Issue Links

          Activity

          Hide
          jbellis Jonathan Ellis added a comment -

          Since INSERT and UPDATE are semantically identical I don't think it's worth disallowing UPDATE on these tables. Instead, we would define our behavior to be: if you violate the INSERTS ONLY contract by updating existing rows, Cassandra will give you one of those versions back when you query it, but not necessarily the most recent. This allows us to preserve our optimizations while doing something reasonable in the face of a broken contract.

          Show
          jbellis Jonathan Ellis added a comment - Since INSERT and UPDATE are semantically identical I don't think it's worth disallowing UPDATE on these tables. Instead, we would define our behavior to be: if you violate the INSERTS ONLY contract by updating existing rows, Cassandra will give you one of those versions back when you query it, but not necessarily the most recent. This allows us to preserve our optimizations while doing something reasonable in the face of a broken contract.
          Hide
          snazy Robert Stupp added a comment -

          IMO it would be logical to disallow UPDATE for WITH INSERTS ONLY tables (and that's what with INSERTs only says).

          Would WITH INSERTS ONLY mean to also restrict to primary-keys without clustering-key?
          Maybe I didn't completely get it. What I'm thinking about is that one partition can still be split over memtable + multiple sstables - which would conflict with the compaction/read-path optimizations. For example, if you have a table with PRIMARY KEY ( (year, month, day), hour, minute, second) with several millions INSERTs per day, it's likely that this will result in multiple sstables per day. Mean - I'm a bit afraid that partitions get too tiny with all its consequences (too many queries, not able to insert from different clients for the same day).

          If such a WITH INSERTS ONLY table has no clustering-key, even more optimizations might be possible (key-cache key would not need the sstable ref in the key, but in the value - so we could do the key-cache lookup and skip bloom-filter lookup on hit).

          Show
          snazy Robert Stupp added a comment - IMO it would be logical to disallow UPDATE for WITH INSERTS ONLY tables (and that's what with INSERTs only says). Would WITH INSERTS ONLY mean to also restrict to primary-keys without clustering-key? Maybe I didn't completely get it. What I'm thinking about is that one partition can still be split over memtable + multiple sstables - which would conflict with the compaction/read-path optimizations. For example, if you have a table with PRIMARY KEY ( (year, month, day), hour, minute, second) with several millions INSERTs per day, it's likely that this will result in multiple sstables per day. Mean - I'm a bit afraid that partitions get too tiny with all its consequences (too many queries, not able to insert from different clients for the same day). If such a WITH INSERTS ONLY table has no clustering-key, even more optimizations might be possible (key-cache key would not need the sstable ref in the key, but in the value - so we could do the key-cache lookup and skip bloom-filter lookup on hit).
          Hide
          jbellis Jonathan Ellis added a comment -

          IMO it would be logical to disallow UPDATE for WITH INSERTS ONLY tables

          I guess I don't care if we disallow UPDATE; my larger point is that the normal semantics of INSERT allow UPDATE-like behavior, so we need to redefine those slightly to allow us to optimize.

          Would WITH INSERTS ONLY mean to also restrict to primary-keys without clustering-key?

          The MV optimization is still useful without that restriction, and you get partial benefits on the others depending on what kinds of read requests are served. So IMO we shouldn't do this for a first cut.

          Show
          jbellis Jonathan Ellis added a comment - IMO it would be logical to disallow UPDATE for WITH INSERTS ONLY tables I guess I don't care if we disallow UPDATE; my larger point is that the normal semantics of INSERT allow UPDATE-like behavior, so we need to redefine those slightly to allow us to optimize. Would WITH INSERTS ONLY mean to also restrict to primary-keys without clustering-key? The MV optimization is still useful without that restriction, and you get partial benefits on the others depending on what kinds of read requests are served. So IMO we shouldn't do this for a first cut.
          Hide
          iamaleksey Aleksey Yeschenko added a comment -

          IMO it would be logical to disallow UPDATE for WITH INSERTS ONLY tables (and that's what with INSERTs only says).

          Would be better to just change the syntax to something like NO OVERWRITES.

          Show
          iamaleksey Aleksey Yeschenko added a comment - IMO it would be logical to disallow UPDATE for WITH INSERTS ONLY tables (and that's what with INSERTs only says). Would be better to just change the syntax to something like NO OVERWRITES .
          Hide
          kohlisankalp sankalp kohli added a comment -

          I think NO OVERWRITES or NO UPDATES is more easy to understand since you can overwrite the data with INSERTS
          We should also log errors if while compaction we find that data has updates.

          Show
          kohlisankalp sankalp kohli added a comment - I think NO OVERWRITES or NO UPDATES is more easy to understand since you can overwrite the data with INSERTS We should also log errors if while compaction we find that data has updates.
          Hide
          snazy Robert Stupp added a comment -

          Another possible optimization is that with append only we'd not need to write tombstones for NULL values.

          Show
          snazy Robert Stupp added a comment - Another possible optimization is that with append only we'd not need to write tombstones for NULL values.
          Hide
          Yasuharu Yasuharu Goto added a comment -

          How about WITH INSERTS ONLY option for each columns?

          In our use case, we have mutable and immutable columns in a table and we're indexing only immutable columns manually now.
          We'll be happy if this optimization could be applied to our app.

          Show
          Yasuharu Yasuharu Goto added a comment - How about WITH INSERTS ONLY option for each columns? In our use case, we have mutable and immutable columns in a table and we're indexing only immutable columns manually now. We'll be happy if this optimization could be applied to our app.
          Hide
          maedhroz Caleb Rackliffe added a comment - - edited

          Would it make sense to reject (or at least allow rejection) for the case of an INSERT with only a partial row?

          Show
          maedhroz Caleb Rackliffe added a comment - - edited Would it make sense to reject (or at least allow rejection) for the case of an INSERT with only a partial row?
          Hide
          aweisberg Ariel Weisberg added a comment -

          For tables that are marked append only it would be nice to have some best effort warnings or feedback if updates do occur. Checking the memtable when writing might be cheap/free and during compaction we can warn and log a conflict if an update is encountered. We could do the same thing on read.

          This would give people with a buggy application (or a bug in Cassandra) rapid feedback rather then silently giving them inconsistent results.

          Show
          aweisberg Ariel Weisberg added a comment - For tables that are marked append only it would be nice to have some best effort warnings or feedback if updates do occur. Checking the memtable when writing might be cheap/free and during compaction we can warn and log a conflict if an update is encountered. We could do the same thing on read. This would give people with a buggy application (or a bug in Cassandra) rapid feedback rather then silently giving them inconsistent results.
          Hide
          rustyrazorblade Jon Haddad added a comment -

          This seems like it would cause cluster errors between the point where you altered a table and when you pushed new code into production. I'd -1 this based on the volume of problems this is likely to cause for people.

          Show
          rustyrazorblade Jon Haddad added a comment - This seems like it would cause cluster errors between the point where you altered a table and when you pushed new code into production. I'd -1 this based on the volume of problems this is likely to cause for people.
          Hide
          krummas Marcus Eriksson added a comment -

          during compaction we can warn and log a conflict if an update is encountered.

          Repair would make this harder since it overstreams - we would have the exact same data in several sstables

          Show
          krummas Marcus Eriksson added a comment - during compaction we can warn and log a conflict if an update is encountered. Repair would make this harder since it overstreams - we would have the exact same data in several sstables
          Hide
          jbellis Jonathan Ellis added a comment -

          Easy enough to ignore merges of the same value. Same problem + solution in the case of timed-out updates that get retried.

          Show
          jbellis Jonathan Ellis added a comment - Easy enough to ignore merges of the same value. Same problem + solution in the case of timed-out updates that get retried.
          Hide
          tjake T Jake Luciani added a comment - - edited

          if you violate the INSERTS ONLY contract by updating existing rows, Cassandra will give you one of those versions back when you query it, but not necessarily the most recent.

          It sounds like you are saying there are no guarantees.

          I've given this some thought and I think the best approach in which we can syntactically "do something" is to combine this ticket with the idea Tyler Hobbs touched on in CASSANDRA-9928. This might be what you are describing we should do but I'll just restate it.

          One possible solution is to require that all non-PK columns that are in a view PK be updated simultaneously. T Jake Luciani mentioned possible problems from read repair, but it seems like, with this restriction in place, any read repairs would end up repairing all non-PK columns at once.

          Basically, this would add a mode where we only allow INSERT of all columns every time. While this sounds restrictive, it also forces the user to deal with the fact that making updates conceptually/logistically hard since we would kick out all client mutations that don't specify all columns. Sure you could subvert this but to me at least, the server can alert the user that updating existing data, as they can with other tables, is hard.

          So the proposal is:

          • Add a table level flag/syntax to mark that a table is INSERT ONLY (which can be altered if there's an emergency).
          • Reject any INSERTS/UPSERTS that do not specify all columns
          • Possibly always return the earliest row if there is a conflict.
          • When writing to the memtable we can add a putIfAbsent method to reject/ignore updates (to cover some minimal bases)
          Show
          tjake T Jake Luciani added a comment - - edited if you violate the INSERTS ONLY contract by updating existing rows, Cassandra will give you one of those versions back when you query it, but not necessarily the most recent. It sounds like you are saying there are no guarantees. I've given this some thought and I think the best approach in which we can syntactically "do something" is to combine this ticket with the idea Tyler Hobbs touched on in CASSANDRA-9928 . This might be what you are describing we should do but I'll just restate it. One possible solution is to require that all non-PK columns that are in a view PK be updated simultaneously. T Jake Luciani mentioned possible problems from read repair, but it seems like, with this restriction in place, any read repairs would end up repairing all non-PK columns at once. Basically, this would add a mode where we only allow INSERT of all columns every time. While this sounds restrictive, it also forces the user to deal with the fact that making updates conceptually/logistically hard since we would kick out all client mutations that don't specify all columns. Sure you could subvert this but to me at least, the server can alert the user that updating existing data, as they can with other tables, is hard. So the proposal is: Add a table level flag/syntax to mark that a table is INSERT ONLY (which can be altered if there's an emergency). Reject any INSERTS/UPSERTS that do not specify all columns Possibly always return the earliest row if there is a conflict. When writing to the memtable we can add a putIfAbsent method to reject/ignore updates (to cover some minimal bases)
          Hide
          tupshin Tupshin Harper added a comment -

          Basically we are talking about frozen rows(as analogy to frozen collections), and I am very much in favor of this. Many use cases, particularly IoT, would be able to use such an optimization while still benefiting from representing data in highly structured columns.

          Show
          tupshin Tupshin Harper added a comment - Basically we are talking about frozen rows(as analogy to frozen collections), and I am very much in favor of this. Many use cases, particularly IoT, would be able to use such an optimization while still benefiting from representing data in highly structured columns.
          Hide
          slebresne Sylvain Lebresne added a comment -

          Shouldn't our first step to validate that the optimization we have in mind actually make a meaningful difference (without having to bend a benchmark too hard to show benefits)? It seems clear to me that this will add complexity from the user point of view (it's a new concept that will either have good footshooting potential (if we were to just trust the user to insert only without checking it) and be annoying to use (if we force all columns every time)), so it sounds to me like we would need to demonstrate fairly big performance benefits to be worth doing (keep in mind that once we add such thing, we can't easily remove it, even if the improvement become obsolete).

          tl;dr, I don't love that whole idea as I think it adds complexity from the user point of view (don't get me wrong, if we could validate this at insert time, I'd be a lot more fan, but we can't), and I'm wondering if given DTCS and other optimization we have internally this really bring that much to the table.

          Show
          slebresne Sylvain Lebresne added a comment - Shouldn't our first step to validate that the optimization we have in mind actually make a meaningful difference (without having to bend a benchmark too hard to show benefits)? It seems clear to me that this will add complexity from the user point of view (it's a new concept that will either have good footshooting potential (if we were to just trust the user to insert only without checking it) and be annoying to use (if we force all columns every time)), so it sounds to me like we would need to demonstrate fairly big performance benefits to be worth doing (keep in mind that once we add such thing, we can't easily remove it, even if the improvement become obsolete). tl;dr, I don't love that whole idea as I think it adds complexity from the user point of view (don't get me wrong, if we could validate this at insert time, I'd be a lot more fan, but we can't), and I'm wondering if given DTCS and other optimization we have internally this really bring that much to the table.
          Hide
          tjake T Jake Luciani added a comment -

          The big win is for MV and probably 2i since we don't need to read before write anymore.

          Show
          tjake T Jake Luciani added a comment - The big win is for MV and probably 2i since we don't need to read before write anymore.
          Hide
          slebresne Sylvain Lebresne added a comment -

          The big win is for MV

          Well, that kinds of bring the point that "just" forcing the user to provide all columns actually doesn't work, because you still need to respect timestamps if there is multiple updates and so "keeping" the first hit, or in the case of MVs, you can't assume the update will have a higher timestamp than what's in store.

          So you do need to make some assumption that there is no override, and we can't valid that, which I don't love.

          Show
          slebresne Sylvain Lebresne added a comment - The big win is for MV Well, that kinds of bring the point that "just" forcing the user to provide all columns actually doesn't work, because you still need to respect timestamps if there is multiple updates and so "keeping" the first hit, or in the case of MVs, you can't assume the update will have a higher timestamp than what's in store. So you do need to make some assumption that there is no override, and we can't valid that, which I don't love.
          Hide
          tjake T Jake Luciani added a comment -

          But we can force the user to always send all the data and block any partial writes/updates.
          I guess my point was I think that's enough or the best we can do ATM. The alternative being proposed, I think, was just state the contract.

          Show
          tjake T Jake Luciani added a comment - But we can force the user to always send all the data and block any partial writes/updates. I guess my point was I think that's enough or the best we can do ATM. The alternative being proposed, I think, was just state the contract.
          Hide
          jjirsa Jeff Jirsa added a comment -

          Old stale ticket, but fwiw re: Sylvain Lebresne 's comment on dtcs (/twcs) -

          This ticket would enable modifying CompactionController.getFullyExpiredSSTables() to be more closely aligned with what people expect it to do, eliminating the need for sstableexpiredblockers and the like. Performance wise, its probably a minor gain, but operationally it's probably a big net gain in terms of least-surprise.

          Show
          jjirsa Jeff Jirsa added a comment - Old stale ticket, but fwiw re: Sylvain Lebresne 's comment on dtcs (/twcs) - This ticket would enable modifying CompactionController.getFullyExpiredSSTables() to be more closely aligned with what people expect it to do, eliminating the need for sstableexpiredblockers and the like. Performance wise, its probably a minor gain, but operationally it's probably a big net gain in terms of least-surprise.
          Hide
          michaelsembwever mck added a comment -

          Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.
            ref https://s.apache.org/EHBy

          Show
          michaelsembwever mck added a comment - Bumping to fix version 4.x, as 3.11.0 is a bug-fix only release.   ref https://s.apache.org/EHBy

            People

            • Assignee:
              Unassigned
              Reporter:
              jbellis Jonathan Ellis
            • Votes:
              9 Vote for this issue
              Watchers:
              40 Start watching this issue

              Dates

              • Created:
                Updated:

                Development